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ABSTRACT 


la this Phase I study funded under the Small Business Innovation Research (SBIR) program, 
statistical methods are developed using the predictive inference and entropy approach. Previous 
recent research has derived entropy as the natural measure of model approarimation error from 
the fundamental s ta t istical principles of sufficiency and repeated sampling. Jq this study, the 
areas of nonnested multiple comparison, multivariable time series analysis, adaptive time series 
analyns of changing p r o c e ss es , and optimal small sample inference are investigated. Constrained 
maximum likelihood methods are developed for general nonnested multiple comparison. For the 
asymptotic optimality of these methods, a condition on the Fisher information and Hesrian 
matrices must be satisfied. Applying these results to multivariate time series analysis, lower 
bounds are derived for the achievable accuracy of the estimated transfer function and spectral 
matrices. Markov and canonical variate analysis (CVA) provide a means of numerically and sta¬ 
tistically stable model fitting of multivariable time series, and these methods provide a basis for 



modeling and fitting time varying models of changing ptocesses.jjkfethods are derived for the 
optimal selection of data length for fitting slowly changing processes^ well as for optimal selec¬ 
tion of the data interval for detection of abrupt changes. Optimal smak sample methods for mul¬ 
tivariate analysis are studied, and entropy methods are shown to provide significant improvements 
in very small samples. Recommendations for Phase 0 research and development focus on the 
adaptive and nonadaptive time series analysis procedures developed in this study. 
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L INTRODUCTION AND OVERVIEW 


la this study, statistics! methods are developed using predictive inference and entropy. This 
approach to statistical inference allows the treatment of several difficult statistical problems that 
are not easily dealt with using traditional statistical methods. The particular statistical problems 

ad dr e s se d are 

• statistical model building involving the determination of parametric 
model structure and order in the general case of multiple nonnested 
alternatives, 

• time series modeling and forecasting involving the determination of 
parametric model structure and order, 

a adaptive time series analysis involving optimal methods for tracking 
skm changes as well as for detecting abrupt changes or failures, 

a small sample inference for multivariate distributions of the exponen¬ 
tial family. 

A number of issues in these topics are resolved naturally in the predictive inference and entropy 
setting. This report provides an overview of the progr e ss of the Phase I research with detailed 
papers included in the Appendices. 

The recent interest in predictive distributions has come from several directions. Modem 
developments apparently begin with Jeffreys (1961, p.143) using a Bayesian approach as has much 
of the work following (Aitchison and Dunsmore, 1975, preface and pJ9). The frequentist 
viewpoint taken in this proposal has been stimulation by small sample problems (Murray, 1977, 
1979), model order and structure determination problems involving parametric models (Akaike, 
1973, 1974,), and nonnested multiple comparison problems (Larimore, 1977a, 1977b). Classical 
methods are conceptually ill-suited or perform poorly in practice on such problems. 

In the first Chapter, the approach using predictive inference and entropy is described. The 
basis of this ap p t nach is the derivation of entropy from the fundamental statistical principles of 
sufficiency and repeated sampling in the context of the predictive inference setup as first 
presented in Larimore (1983a). This provides a sound theoretical foundation that was previously 
lacking for the use of entropy as the natural measure of the error in approximating a true future 
density by an estimated predictive density based upon a present sample. The generality of this 
entropy measure allows comparison of statistical inference methods and the derivation of more 
optimal inference procedures including 







• general inference methods such as parametric or nonperametric 

methods 

• exact evaluation of small sample procedures 

• determination of model order or structure including the case of non* 
nested multiple comparison 

• aeries analysis including definition of optimal tracking of time 
varying p ro c es ses and optimal detection of abrupt changes. 

The generality of the predictive inference and entropy approach provides a basis for the generali¬ 
zation of the present statistical and predictive inference methods to more general statistical prob¬ 
lems. 

In Chapter 2, the multiple comparison of nonnested constrained models is developed. Previ¬ 
ous developments in multiple comparison have considered largely the nested case and assume that 
the true model is contained in one of the models. In the pr e s en t study, the case of constrained 
maximum likelihood estimation is considered where the true model may not be contained in any 
of the hypothesized models. The entropy measure provides a measure that allows the multiple 
comparison problem to be viewed as a model approximation problem. In this more general con¬ 
text the A1C procedure and generalizations of it are found to give asymptotically optimal predic¬ 
tive inference procedures as measured by entropy. In the nested case, these procedures reduce to 
the generalized likelihood ratio (GLR) test where the probability of rejection is a function of the 
number of additional parameters in the alternative model not contained in the null hypothesis. 

Time series analysis for nation ary processes are considered in Chapter 4. The entropy meas¬ 
ure provides a direct interpretation of the achievable accuracy in estimation of the power spec¬ 
trum of a process. The entropy is expressed as a squared relative error in estimating the spec¬ 
trum. A generalization of this to multiple time series relates to principle components of the pro¬ 
cess cross spectral matrix. A lower bound is determined such that the expected integral of the 
squared relative error in spectral estimation is bounded by the number of estimated parameters 
divided by twice the sample size. An example of spectral estimation of an ARMA(43) process 
using spectral —*«*w*»^g. Autoregressive modeling, and ARMA modeling shows the relative error 
in them estimation methods as dependent on the number of estimated parameters. 

The topics of Markovian time series or state space models provides an approach to time 
aeries analysis that is readily computable and is easily extended to the cam of changing pro ces s es . 
The general stats space model form is developed for Markov processes. The canonical variate 
analysis (CVA) method gives a direct and numerically stable computational method for determin¬ 
ing stats space models from observational data. The basic computational method is the general¬ 
ized singular value decomposition (SVD). This method allows for the direct determination of the 








o ptimal model state order without the computationally intensive fitting of such models for the 
evaluation of model fit. Once the model state order is determined, the state space model coeffi¬ 
cients are simply computed by regression. This method generalizes easily to changing processes. 

Adaptive time series methods are developed in Chapter 5. Primarily two types of changes 
are considered, slowly varying changes and abrupt changes such as faults. Time varying Markov 
p roc e ss e s are developed for such changing p roc e ss e s. Such proc e ss e s provide the hypothesized 
models for developing optimal tracking and detection of abrupt changes. An A1C based pro¬ 
cedure is derived tor the near optimal selection of the data length to use in model fitting. An 
example is given of estimating the spectrum of a time varying processes that gives results near the 
best previous solutions that are much more specialized. For abrupt change detection, a generali¬ 
zation of the AIC procedure is required since the comparison of models fitted on different data 
intervals is required which is not considered in the AIC formulation. Application of these 
methods to simulated data of abrupt changes in an ARMA(43) processes including jumps in the 
state, changes in the dynamics, and change in the variance of the excitation noise pro ces s es , 
demonstrates that the procedure is sensitive to the detection of these very different types of 
abrupt changes. 

Small sample multivariate inference procedures are described in Chapter 6. Since the 
entropy measure gives an exact rational measure of the relative error of statistical inference pro¬ 
cedures in small samples, it provides the bases for evaluation and development of small sample 
inference methods. The historical approach to predictive inference involves the derivation of a 
Bayesian predictive density. Although the method is Bayesian, in certain instances, the resultant 
predictive density has certain invariance properties which lead to an optimal predictive density in 
terms of the entropy measure. Another approach involves the direct solution for the optimal 
invariant predictive density minimisin g the entropy measure. This optimal invariant procedure 
leads to the same predictive density as the Bayesian predictive density using a nooinformative 
prior. These methods are compared for the multivariate normal distribution with the estimative 
and best normal estimation procedures. 





2. APPROACH USING PREDICTIVE INFERENCE AND ENTROPY 


The concepts of prediction and inference based on a set of data are very old and underlie 
much of the scientific method. While the scientific method has been much discussed in philo¬ 
sophical and qualitative terms, there has been very little in the literature from a basic statistical 
viewpoint. The moat extensive literature appears to be that associated with predictive densities or 
predictive distributions (see Aitchison and Dunsmore, 1975). That approach is to a large degree 
Bayesian, although more recent treatments have developed a purely frequency sampling interpre¬ 
tation in connection with use of entropy or Kullback information. The weak point in the fre¬ 
quency approach was the seemingly arbitrary use of the entropy measure of model approximation 
error. More recently, however, the result of Larimore (1983a) has established the fundamental 
nature of the entropy measure based upon the statistical principles of sufficiency and repeated 
sampling. The entropy measure has in addition a very natural interpretation as the log relative 
odds in comparing two predictive densities in predicting the future sample. This gives a central 
role to the entropy measure. The use of the entropy measure for decisions on model order and 
structure was pioneered by Akaike (1973), and has been applied to many diverse statistical prob¬ 
lems particularly in time series analysis. The justification given to the entropy measure in the 
Akaike approach, however, has been largely heuristic. Because of the importance of the justifi¬ 
cation of the entropy measure, the derivation and important concepts are outlined below in Sec¬ 
tion 2.1. The use of entropy in comparing model structure selection procedures and for exact 
small sample inference is discussed in Section 22. This approach to developing statistical pro¬ 
cedures using predictive inference and entropy is then applied to the various topics in the follow¬ 
ing chapters of the report. 


22 Derivation of the Entropy Measure of Apprarimatfan 

Predictive inference involves an experimental situation with two trials, an informative trial 
with observations x and a predictive trial with observations y. The joint distribution of the two 
trials is permitted any statistical dependence and is described by a joint probability density p(x j). 
The objective is to choose a predictive distribution or density which, for each possible observed x , 
is a probability density for the future outcome y. More precisely, consider a family p(y\ x,a) of 
predictive densities where the index a specifies a particular predictive distribution. For a particu¬ 
lar choice of a, p(y\xjt) can be viewed as a conditional probability density of y given x for 
predicting the distribution of the future observation y given an observed value of x. The predic¬ 
tive inference problem involves selection of a criterion of fit for appraising the goodness of 
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approximation to the true conditional probability density p(y\ x) by the various predictive densi¬ 
ties p(y\ x,a) specified by different a. The choice of such a criterion of fit is the primary topic of 
this section. Negative entropy is derived as the natural measure of model approximation error for 
any predictive distribution. 

Consider a family C “ { pjy\ x), aiA} of predictive densities for approximating the true 
density p.(y\ x) of the predictive experiment y given the informative experiment x, where x and y 
are vectors of dimension K and L respectively with true joint density p.(x j). For the predictive 
inference problem, a relative measure of goodness of approximation of p,(y\ x) by the various 
pjy | x) is desired. To this end, a repeated sampling experiment is considered in which joint ran¬ 
dom samples (x ia y<) for <~1,...A, are drawn repeatedly from a population with density p.(xj). 

The probability density of the joint predictive experiments Y ~(y x . y N ) predicted by the a-th 

model using X - (x x . x M ) is 

PaO'iJo-np.fok) (2-1) 

<-i 

The probability density for Y can be considered as indexed by the pair (a,X). Statistical infer¬ 
ence is considered about the true density p.(Y\X) of Y from among the family of probability den¬ 
sities F - {p,(K| X), a€4} for a fixed X. 

To consider the essential statistical information about the future sample Y given by the 
predictive densities p^Y] X), the sufficiency of the likelihood function (Zacks, 1971, p. 61) is 
used. From this principle, any inferences about the family F drawn on the basis of the sample 
(T| X) follow from the observed values of the likelihood function p a (y| X) for aG4. The set of 
likelihood ratios formed from pairs of these likelihoods is also a sufficient statistic (Cox and Hink- 
ley, 1974, p. 20-1, see also p. 37-9 for a discussion of likelihood and sufficiency principles). 

For inference about the densities p x and p 2 > sll of the information is contained in the likeli¬ 
hood ratio 


, * Pi(y<l*«) 

" pm*,) 


which has the intuitive interpretation of the relative odds of observing the data Y of the repeated 
predictive trials from each of the distributions p x and p x given fixed informative data X. The 
behavior of A* as the number of repetitions becomes large is most easily seen by expressing it as 


J2 S '<***<*,*0 - Urn i 













- Slp-to*) log 


Pi(y|*) 

P2(>l*) 


dx dy 


- //p«(y| *) lo g ^ [ ~J dy p.(x) dx 


(2-3) 


For a large number of repetitions, the odds will overwhelmingly favor p x or if the limit is 
respectively strictly positive of strictly negative. The preference for one distribution over the 
other as expressed by the likelihood ratio tends to grow exponentially with the number N of 
repeated trials. If (2-3) is zero, then there will be no consistent preference with large numbers of 
trials. Although for a finite number N of repetitions the likelihood ratio A* depends upon the 
particular samples (X,r), asymptotically for large numbers of repetitions this dependence disap- 


The direct pairwise comparison of predictive densities is not necessary if the Kullback- 
Leibter conditional discrimination information (Kullback and Leibler, 1931; Kullback, 1939, p. 13) 

x(P* - J>.(yl x) log gg j ^ dy (2-4) 

of p a relative to p, is used which is a function of x. Note that the order of p a and p. are not 
interchangeable with the latter playing the role of the truth. 

The likelihood ratio (2-3) is expressed in terms of the Kullback information as 

ton log A* - E x {I^ x (p, x (p.fid) (2-5) 

where E x denotes expectation with respect to the true density p.(x). The criterion is thus deter¬ 
mined as the negative entropy, or negentropy for brevity, defined as 

*(P*fJ * E x I^ x (p,fiJ - fp.(x)dx fp.(y\ x) log ^ dy (2-6) 

the expected Kullback conditional information of the predictive density relative to the true condi¬ 
tional density p,(y\x). In the repeated sampling experiment, the predictive density with the 
smaller negentropy relative to the true is ultimately preferred. The negentropy (2-6) thus orders 
the goodness of a set of predictive densities in approximating the true density. Also in comparing 
any two predictive densities p x and p x , the respective difference has the intuitive interpretation as 
the exponential rate at which the likelihood ratio diverges. 

The above derivation of the entropy measure of approximation of a predictive density uses 
only the predictive inference setup in the repeated sampling context along with the principle of 




j 



sufficiency. The sufficiency principle is one of the few generally accepted principles in statistical 
inference. Various repeated sampling principles have been formulated, however the difficulty has 
been the choice of an evaluation criterion for comparing various sampling distributions. The 
entropy measure gives a criterion that is based upon basic statistical principles of inference. 

2 J The Use of Entropy in Statistical Inference 

The entropy measure of error in approximating a predictive density is very general and can 
be applied in diverse modeling problems. In this section, some of the general model selection 
problems are described which include the nonnested multiple comparison problem, adaptive time 
series analysis of changing processes, and optimal small multivariate methods. 

From the derivation of the entropy measure, it can be seen that the entropy measure has a 
number of .very attractive features: 

• It applies to completely general modeling problems including non- 
parametric methods. 

• It applies exactly to small samples. 

• Only the fundamental statistical principles of sufficiency and 
repeated sampling are used. 

• It applies to time correlated problems such as time series model 
identification and tracking. 

• Statistical inference can be fundamentally viewed as model approxi¬ 
mation. 

Note also that the predictive distribution can include an entire model structure- 
determination/parameter-estimation scheme by setting 

P* 6i *) m p(y >0* (j )(*)) (2-7) 

where for every x, k(x) is the * minimizing a model structure determination criterion. Thus for 
each a, p a can be regarded as a model fitting procedure including the choice k(x) of model struc¬ 
ture. 

The negentropy measure is entirely applicable to exact small sample inference, system iden¬ 
tification, and detection of abrupt changes that include decisions among a multitude of 
parametric model structures which may be nonnested. Several predictive inference problems 
have been considered in the literature. Previous work has used the negative entropy measure in 
much more restricted formulations where 
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I a the informative and predictive samples were assumed to be indepen¬ 

dent (Aitchiaon and Dunsmore (1975), Akaike (1973)) which does 
not include the time series forecasting problem 

• the use of the negative entropy measure was considered as arbitrary 
(Aitchiaon (1975), Murray 1979)) or justified only asymptotically for 
large samples by heuristic arguments (Akaike (1973)) 

• the negative entropy measure was justified as a basis for comparing 
distinct parametric model structures (Akaike (1973)), but not for 
comparing model selection procedures which include choice of the 
model structure (Larimore, 1983a) 

• only the case of nested structures such as autor e gressive models 
were justified (Akaike (1973)) although there has been wide spread 
! application of it to the general nonnested case such as ARMA 

models 

• the previous literature on the use of information theory in statistical 
inference justifies its use by arguments of information transmission, 

I a set of postulates supposed to be obvious, or by analogy with 

entropy in statistical mechanics none of which are convincing from 
the point of view of statistical inference (Kendall (1973), Hart 
(1971)). 

Thus the results of Larimore (1983a) give a solid theoretical justification for the use of the nega¬ 
tive entropy measure in a general setting which makes possible the further general development 
I of predictive inference statistical methods. 

For the parametric case, the very general considerations above simplify somewhat. For the 
structure determination problems, an estimator of the form as in (2-7) associates a parameter esti¬ 
mate 8(a) with each poanble value a of the sample apace. To simplify the discussion in this sec- 
f tion, we can consider that the informative experiment x (fit set) and predictive experiment y 

(check set) are independently and identically distributed K-dimensiooal vectors. In Section 4, the 
general dependent time aeries aaalyms case will be discuamd. We predict the density of y by 

\ p(y,t(x)) where p(y,4) is the parmmeteriatd class of densities for y. The negative entropy meas- 

) _ 

' ure (2-6) reduces in the parametric csss to that snggs start by Akaike (1973) and is expressible as 

«<p.j) -£.*<,.*) *t (2-8) 

where * d e not es the true value at the parameter • and where E M denotes expectation with 
r espect to the informative sample < Fhat the estimator •(*) may involve different model orders 
or structures as in (2-7) is ao conceptual difficulty although it may complicate the evaluation of 
the na g a at ropy (2-4). The predictive temple , icheck set) is never actually drawn, but we wish to 
t devise de ci rtoa procedures which would optimally predict in terms of (2-4). 








The major problem is to devise model-estimation/structure-determination schemes 

which come cioce to minimising the negentropy. A major step in that direction was made by 
Akaike (1973) in proposing an extension of the maximum likelihood method to compare different 
model orders or structures. Suppose that § 4 (x) is the m axi m um likelihood estimator for a given 
restriction of the parameters 9 to a subspace H k that is defined for every x in the sample space. 
Then we wish to partition the sample space into the disjoint subsets X k so that for xQC, the esti¬ 
mator 

0*0 = 0*00 for x(X k (2-9) 

is used. Akaike (1973) shows that asymptotically for nested models, an unbiased estimate of the 
negentropy using the maximum likelihood model §*(*) for the whole sample space x(X is given 
by the Akaike information criterion (A1C) defined by 

AIC(k) = -21np(x,0 t (x)) + 2 K(k) (2-10) 

where K(k) is the dimension of H k , i.e., the number of parameters estimated. The Minimum 
A1C Estimate (MAJCE) proposed by Akaike (1973,1974b) is to partition the sample space so that 
X k is the set of sample points for which 

AIC(k) < A/CO) for j = k (2-11) 

Then the MA1CE estimate is 

Wi(*) = 0*C0 for x(X k (2-12) 

so that on the set X k , 6 UAJCE (x) is the maximum likelihood estimate 9 t (r) corresponding to the 
model structure k with minimum A1C. 

For autoregressive models, Shibata (1981a) has studied the MAJCE and other asymptotically 
equivalent procedures for model-estimation/order-determination. He adopted a spectral measure 
of accuracy that is asymptotically equivalent to the negentropy. He showed that asymptotically 
for large sample, MAJCE minimi*** the negentropy measure of accuracy (2-8), which will be 
called entropy efficiency. Hence MAJCE is asymptotically an optimal procedure for choosing 
autor eg r es s i ve models. Shibata (1981b) also shows MAJCE as asymptotically optimal for regres¬ 
sion problems which involve nonnested multiple comparisons. Other procedures for model order 
determination have been proposed (Bhansali and Downham, 1977; Schwarz, 1978) which 
emphasize the choice of true model order asymptotically for large samples, which is called order 
consistency. In most real problems the true order is infinite, and even if such a fiction were to 
exist, a predictive criterion is much more intuitive in most applications. Shibata (1983) has shown 
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that order consistency and entropy efficiency are mutually exclusive so that a choice is required 
as to which of these criteria is most important. In particular, Shabata has shown that an order 
consistent procedure cannot be entropy efficient, and that an entropy efficient procedure will not 
be order consistent. 

Turing now to a different general problem, that of exact small sample inference, predictive 
inference and entropy provides a new approach to the problem. Past approaches to the small 
samp tference problem have involved a number of ad hoc procedures. The entropy measure 
provides a sound fundamental measure of the approximation error in predicting the density of the 
future experiment. One of the pest approaches has involved the tstiirutive method where the 
predictive density is restricted to lie in the class of densities a—nn »H to contain the true. Recent 
results have shown that the use of more general predictive densities can give more optimal results 
as measured in terms of the entropy measure of model approximation error (Murray, 1979, 1977; 
Aitchison, 1975). The optimal predictive density has been derived in the class of invariant densi¬ 
ties minimizing the entropy measure. This was derived before the ft un fitting of the entropy 
measure based upon the sufficiency principle. As shown in Chapter 6, this more general and 
optimal predictive density can be considerably better as measured by the entropy. 

In time aeries analysis, the advantage of the approach using predictive inference and entropy 
is that it provides a sound theoretical framework in terms of model approximation for the direct 
comparison of very general time series analysis models including: 

• Consideration of many complex hypotheses 

• Comparison of nonnested hypotheses 

• Comparison of dynamic models of different dynamic (state) orders 

• Consideration of models fitted over different data sets for detecting 
abrupt changes 

• Consideration of different adaptation rates for doing optimal model 
tracking 

The comparison of such diverse models is inherent in adaptive time series analysis and abrupt 
change detection, and previous investigations have not had available such a sound and general 
framework for solving these difficult problems. 








3. CONTAINED NONNESTED MULTIPLE COMPARISON OF MODELS 


la this chapter, the general problem of nonnested multiple comparison is considered. In 
order to develop a general theory, consideration is restricted to comparing models that are the 
result of «w*»1wiuh» likelihood estimation. The objective of the discussion is to gen* 

eralias the currently available procedures for nonnested multiple comparison in the constrained 
wnwimimi likelihood cont en t. 

The approach is to view the fitting of each alternative parametric model form as an approxi¬ 
mation procedure, which includes the notion that the true model is in general not contained in 
the class of parametric models considered in the model fitting. This is a departure from previous 
approaches that involve primarily asymptotic arguments where the parametric models approach 
the true mode! as the —size becomes large. Such an asymptotic argument begs the question 
of model approximation since asymptotically there is no error in the approximation. It is very 
important in practice to determine the extent to which the asymptotic approximations are accu¬ 
rate in moderate or small samples. 

Another area of weakness in available approaches is the assumption of nesting in comparing 
models using entropy methods. The derivations of Akaike involve the assumption of nested 
models which considerably restricts the applications of the methods. In practice, the A1C cri¬ 
terion has been applied in a much wider context than the comparison of nested models. 

The results of this chapter will provide the basis of much more general decision procedures 
for adaptive time series analysis. In these problems, the comparison of different models based 
upon different intervals of data are compared to determine if an abrupt change has occurred. 
Previous entropy methods have only compared different models for a given interval of data. 

34 Cenatadnsd llsilmam faffceed Fnimsriia 

The first result to be discussed is the generalization of the usual maximum likelihood theory 
to the constrained case. The regular case is considered where the log likelihood function is 
ex p andable in a Taylor series (Cox and Hinkley, 1974, p. 281). These conditions permit the inter¬ 
change of expectation and differentiation. 

Following the notation of Larimore (1986d, contained in Appendix A), let /(*,•) denote the 
log likelihood function of the informative sample x considered as a function of the parameters 0. 
Denote by £ the expectation with respect to the true density p (x ,9) with true parameter 0. The 
negative entropy measure is used as the measure of approximation to the true density by an 






approximating density p(x,0*) with parameter value 0* from the subspace 8* of parameters. The 
projection t* of • onto the subapace 0 4 is defined as the parameter value 0* tninimfamg 

*,(•.•*) - «(* i) - El{x ,0*) (3-1) 

so that the projection 0* satisfies the condition 

El (x&)- 0, (3-2) 

where ' denotes the derivative with respect to 0* . The minimum is unique if and only if the Hev 
■an 0* given by D% - E/"(x,0*) is positive definite. Thus for a constrained class of models, the 
projection of the true parameter value defines the best approximation to the true density in the 
clam of approximating densities. 

Consider now the constrained maximum likelihood estimate 0* in the subspace of parame¬ 
ters 8) satisfying the likelihood equation 

r(xM)-o ( 3 . 3 ) 

Then under the regularity conditions, we have tot a positive definite Hessian D? and asymptoti¬ 
cally for large informative sample x that 

• 0* is an unbiased estimator of 0* 

e the estimation error covariance matrix is 

- (.Dfr'E{r T (xM)lXx#)KDfr' (S4) 

For the unconstrained case, the middle term is the Fisher information matrix and is equal to 
minus the Hessian D^, but in the general constrained case this is not true. 

Now consider the likelihood /(yj x ,0) of the predictive experiment y conditioned on the 
informative experiment x. From the above results, the negentropy can be easily determined. 
Expanding the log likelihood to second order and taking expectation gives the negative entropy as 

**.(•■•*) - - jll 4*-i'll + **,(«,«*) (3-5) 

which holds asymptotically for large informative sample. Note that the second term is exact with 
no approximation in small samples. The first term is an approximation involving the variation of 
the maximum likelihood estimate locally about the projection 0*. Thus the bias pan of the error 
in constraining the model is captured exactly. 
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3J Uhbtasd Trtairtu ef fistrepy 

In the previous discussion of entropy, the measure is considered as a measure of approxima¬ 
tion error between the true and approximating density. In practice, the true density is unknown 
and it is desired to obtain an estimate of the negative entropy based upon the observed informa¬ 
tive sample. To simplify the discussion, the case of x and y independent is considered. An accu¬ 
rate estimate of the negative entropy was first obtained by Akaike using the log likelihood as an 
estimate of the entropy with a correction for the bias. The Akaike information criterion (A1C) 

AIC(k) - -21ogp(x,**(x)) 4- 2X(k) (3-6) 

was derived as an unbiased estimate of the entropy where K(k) is the number of parameters 
adjusted in fitting the maximum likelihood estimates. The second term adjusts for the bias in 
estimating the entropy using the informative sample and adjusting the parameters in fitting. 
Akaike (1973) originally derived the A1C as an unbiased estimate for the relative comparison of 
the prediction error in comparing two nested models. The nesting is also important in that 
derivation because the models are not only nested but asymptotically approach the true model. 

In the more general case of constrained maximum likelihood estimation, a difficulty occurs 
in the estimation of the negentropy. Consider as above the case of x and y independent and 
identically distributed. As derived in Appendix A, the expected log likelihood difference of the 
informative sample is 

£[/(x,9) - /(x,9*)J - T (*,?)l (x,0»)} + * r (9,9*) (3-7) 

In the unconstrained case, asymptotically for large informative sample the trace term is equal to 
the number of parameters estimated. Unfortunately in the general constrained case, the Hessian 
is not equal to the Fisher information matrix. The trace then depends upon the expectation with 
respect to the true unknown density of the first and second derivatives of the log likelihood func¬ 
tion at the projection 9*. This cannot be computed in general since the true parameter is unk¬ 
nown. 

In cases where the Hessian and Fisher information are equal so that the trace is equal to the 
number of parameters, then two different parametric model structures, say 9* and 9* can be com¬ 
pared using (3-7) as 

£(/(x,«*) - /(xiOl - * <fim(9*) - + R(9,9>) - *(9,9*) (3-«) 

which is equivalent to the A1C. The denvatioo for constrained maximum likelihood estimates 
makes dear some of the assumptions in previous entropy methods. The major difficulty is caused 
by the variation of the Fisher information or Hessian as the parameter values change. In the 
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derivation above, tha iaua of rating doM not aria. 


UNMMi 

In tha caa of seated teats, the MAJCE criterion reduces to the usual generalized likelihood 
ratio (GLR) teat. The threshold and resulting probability of rejecting the null hypothesis, ie. the 
sia of the test, depends upon the number of additional parameters in the more general model. 

la compering two hypotheses H 0 and H x , the AIC criterion is to choose according to the 
sign of da quantity 


AIC(Hf) -AIC (HO « -21og*kf£ + 2[K (0) - *(1)] (W) 

P(*.r) 

The AIC criterion in the nested caa is equivalent to the decision rule 
choose H 0 if -21ogX < 2(JC (1) - JT(0)] 

chaose H j if -2k)gX 2 2{4T(1) - JT(0>] (3-10) 


when the generalised likelihood ratio X is defined by 


X 


Pit?) 


0 - 11 ) 


The threshold 2[X(1) - JT(0)] is precialy twice the number of additional parameters under the 


hypothesis H v 


In the caa of a normal class of densities, the sia a of the test is easily determined since the 
GLR statistic X is chi-aquared on K( 1) - JT(0) degrees of freedom under the null hypothesis H 0 . 
Thus a is given a the solution of the relationship 


Xl* - 2* (3-12) 

wen « is the number of additional parameters and is the n probability point of the cumula¬ 
tive chi-squared distribution on n degrees of freedom. Solving for a a a function of n gives the 
sia of the test a a function of the number a of additional parameters a shown in Table 1. Note 
tha the traditional a levels of 0.10, 0.05, 0.01, 0.005 and 0001 corr es pond respectively to about 4, 
8,16,20, and 30 additi o nal parameters It ha been known for a long time tha in composite tests 


change in the sia a of the test with the number of additional estimated parameters. 
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Table 1. Dependence of Significance Level a on the Number a of Additional Panmeten 












4. TIME SERIES ANALYSIS USING ENTROPY METHODS 


Methods of predictive inference and entropy offer a number of advantages in the analysis of 
time aeries not available in other methods. In this chapter the basic time series analysis methods 
are described, while in the following chapter adaptive methods for time series analysis are 
developed using predictive inference and entropy. Fust the topic of the achievable accuracy of 
spectral analysis is addressed by relating the entropy measure directly to a relative squared error 
in estimating the power spectrum. Following this is a discussion of the Markovian representation 
of time series in terms of state space models which will be very useful in representing time vary¬ 
ing models of time series. The canonical variate analysis approach to time series is then described 
which forms the basis of the adaptive time series analysis methods developed in the following 
chapter. 

4 J Achievable Sp ec t r al Accuracy 

In this section, the informative and predictive samples will be denoted by u and v respec¬ 
tively to allow for the traditional use of x and y for random processes. Consider the problem of 
identifying a model for a pair (x,j,) of multiple stationary time series where x, and y, are exo¬ 
genous and endogenous time series respectively. Consider linear stochastic models in the form of 
a linear difference equation 

- <7, + r, (4-1) 

H) 

where *(/;•) is a causal linear system giving the response in y, to the past exogenous inputs x,, 
and where q, is white noise. Suppose that the probability density of the process is parameterized 
by §. The exo g enous variable x, will be considered as exactly observed, and the problem of 
modeling y, con diti o nal on x, is considered so that prediction of y, from x, based upon such a 
m odel is the princip le problem. This also includes the problem of no exogenous variable so that 
only y, is observed. 

We wfeh to investigate the achievable accuracy in estimating a model for the process. In 
particular, the entropy measure will be developed to obtain the relationship between the number 
of pa ra m eters e sti ma t ed in the model sad the relative squared error in estimating the power spec¬ 
trum. An examp l e illustrates the effect on the spectral estimation error due to the particular 
dam of pa ram e tric models used in the identification and the number of parameters estimated. 
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The predictive inference setup of Chapter 2 is considered where the primary interest is in 
the a sy mp totic behavior with large sample size of both the informative and predictive samples. 
Consider an obs e rved informative sample M r *Cxf jj. • • ■ •xldl) of size N used to estimate the 
process model, and similarly consider a conceptual predictive sample v of size hi used to evaluate 
the accuracy of the estimated model. The predictive sample is assumed to be identically distri¬ 
buted but independent of the informative sample. Consider the problem of inference on the 
parametric class {?(*,#),•&} of models with probability densities p(v ,ft) based upon the informa¬ 
tive sample a. Consider the conceptual repeated sampling experiment where on each trial the 
samples « and v are each drawn independently from the process 5 (w.d) with • assum e d to be the 
true parameter value. An estimative model ^-p(v ,&(*)) is chosen for the density of v by some 
parameter es t i m a t i on scheme 6(a). For a stationary process, the negative entropy (2-6) is linear in 
the predictive sample sis M , so it is more useful to consider the per sample ne gen txopy. To this 
end, define the p*r sample negemtropy denoted /(p«4). As derived in Appendix B, the I- 
divergence is given asymptotically by 

HSJ) - Um-j-E. / I°t d, 

p(v,e(a)) 

- 7 *. /*<*>#,,(.> - *«<•>»’£ 

+ I e. - rf(-)ls,(-X«(-) - <»(«) T)£ (4-2) 

where E m denotes expectation relative to the informative sample a. 

In the multiple time series case, the spectral measure (4-2) has an intuitive interpretation in 
terms of principal components of the power spectrum in the frequency domain. Principle com¬ 
ponent representations of the spectral matrices S„(w) and 5 a (w) have the form 

/(•) J„(«-W(«) - D(«) , L(») 5„(«) £.*(«) - £(«) (4-3) 

when J(m) and L(m) given as a function of frequency « are unitary matrix transformatioos so 
J (a)/'(e)W-i(«)L‘(a) which diagonals S„(«) and J w (w) respectively and where D(w) and 
£(«•) are d iag on al matricies. Filtering x(r) with transfer function L(m) gives the principal com¬ 
ponent process j(t) which is expressed in the frequency domain as X(w)-Z.(w)X(w) , and which 
has the diagonal spectral matrix £(«). and similarly for ?(/). 

The spectral measure (4-2) is shown in Appendix B to be 
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The first sum on the right hand side is the integrated squared relative error of the estimated cos¬ 
pectra of the principal components, while the second term is the integrated squared coherency of 
the estimated qpectrum D(m) which would be aero if D(«)-D(w). Thus the measure (4-2) has a 
dear interpretation in the multivariate case when the true spectrum O(w) is diagonal but where 
the appradmatiag spectrum O(w) is permitted arbitrary coherency among components. The third 
term in the spectral measure (4-4) is asymptotically equivalent to replacing £„(«) by S„(«). This 
term is invariant to the unitary transformations /(«*) and L(m) where G («)-/( M W(«V>*(<*) is the 
transfer function H(m) expressed in the coordinate frame of the principal component series x(r) 
and y(t). The squared magnitude error | G tJ ( w) - G jy («)j 2 in the ij element of the transfer func¬ 
tion b weighted by the input signal to output noise ratio D («)*/£ (w) ;y for the 

The spectral measure of accuracy can be bounded in terms of the number of parameters 
estimated. Suppose first for simplicity that the parametric class of modeb contains the true pro¬ 
cess and that k parameters are estimated. Then by Appendix B using the Cramer-Rao lower 
bound, the per sample negentropy b bounded by 


E m l(Sj ) - Em 2^(6 - 9fF(t - 6 ) * ± (4-5) 

with equality achieved asymptotically for large informative sample N. Thu implies the bound on 
the achievable accuracy in spectral estimation given by 


+ 7 *. (-) - «(-)ls D (-X«(-) - (*-<>) 

In the more general case where the order is infinite and the MA1CE procedure for choosing j 

model order k b used, then Shi beta (1983) has shown the following result. For each informative 
sample size N there b an optimal order k'(N) which minimizes the tradeoff between variability 
and bias in the entropy measure 

*(W> - -yll •*-4*11 l, + *<•■•*) - ~ *(*.•*) (4-7) | 





Then asymptotically for large informative sample N, the negative entropy of the MA1CE model 
selection procedure is exactly that of the model selection procedure using a fixed number of 
parameters equal to the optimal order **. Thus even in the case of using an entropy efficient 
model selection procedure where the true model order is infinite, the achievable accuracy of 
spectral estimation (4-6) can be bounded by the function (4-7) of the sample size N with k -4*. 

To illustrate the use of the lower bound on the achievable accuracy, consider the ARMA 
(43) process 


y, - 13136 y,. x - 1.4401 y,_ 2 + 1.0919 y,_ s - 033527 y,^ 

+ w, + 0.17921 w,_, + 032020 w,_ 2 + 026764 w,. y (43) 

with the ndise variance of w as Q - 1.7258xl0~ 2 . This process was analyzed by Gersch and Sharp 
(1973) and Akaike (1974b) to show the increased accuracy of ARMA models over AR models. 
For a sample size of 800, the optimal order was found to correspond to 4* - 18 for fitting an AR 
model to the data. Akaike (1974b) fitted several models to simulated data using AR, ARMA, 
and Hanning window methods. Figure 1 shows the variability term of the spectral error as a 
function of frequency for the various model fitting procedures. Since the optimal order was used 
for the AR model, the bias is also included. The ARMA model has no bias smc*. the full order is 
chosen with high probability. On the ocher hand, the Hanning window has significant bias since a 
fixed bandwidth is used to spectrally smooch the data at all frequencies, and this bandwidth is not 
sufficient to estimate the sharp peak without bias. Use of a wider bandwidth would increase the 
already large error at all the ocher frequencies. The AR and ARMA methods are clearly adap¬ 
tive in that the methods have a greater bandwidth to accommodate the rapid changes (large 
second derivative or curvature) near the peaks and troughs but lower bandwidth in regions with 
low curvature. The greater parametric efficiency of the ARMA method is clearly depicted in 
these regions of low curvature. Repeated simulation of the time series data from the ARMA(43) 
model and the maximum likelihood estimation of ARMA models with MA1CE confirms that the 
lower bound indeed gives an accurate description of the spectral estimation error in practice (Lar- 
imore, Mahmood, and Mehra, 1984). Independent methods have been developed for obtaining 
simultaneous confidence bands on spectral estimates (Larimore, 1986c). Such confidence bands 
are proportional to the integrand of the spectral measure of accuracy so that the integrand gives 
an accurate measure of spectral error at each frequency as well. 
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Figure 1. Spectral Accuracy of Different Model Fitting Procedures (lower curves) in Approxi¬ 
mating the True Spectral Density (upper curve). 

4 J Markovian Mod* of Time Scrim 


In this section Markovian or state space models of time series are reviewed. Such models 
have not been widely used in time series analysts, although there is wide spread use of such 
models in filtering and prediction with numerous applications in engineering. State space models 
have a number of advantages in time series analysis that are attractive for automatic implementa¬ 
tion on m fcro pro cem ors using the canonical variate analysis method discussed in the next section. 
Such procedures allow the automatic selection of model state order using entropy methods and 
lend themselves to adaptive methods for time varying p ro cesms discusmd in the next Chapter. 


The starting point of any approach is the joint probability distribution of the past and future 
observations ,4) where p, are the past inputs u, and outputs y, up to time t and/, are the 
outputs y, in the future at time t defined by 
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pf~( O'M mJL i O'iW) - f! = O', +l 0'/+2 . • ) (4-9) 

and 8 is a vector of parameters indexing the model. A fundamental property of a Markov process 
of finite sate order is the existence of a finite dimensional sate x, which is a linear function 

x, - Cp, (4-10) 

of the past p,. The sate x, has the property that the distribution of the future f, conditioned on 
the past p t is identical to that of the future /, conditioned on the finite dimensional sate x, so 

P(f,\p,fi) =p(f,\x,fi) (4-11) 

Thus, only a finite amount of information from the past is relevant to the future evolution of the 
process. 

A stationary Markov process of some particular sate order can be represented by a vector 
difference equation with the general form (Lindquist and Pavon, 1981) 

x , +1 = 4>x, + Gu, + w, (4-12) 

y, = Hx, +Au t + Bw, + v, (4-13) 

where u is an input vector process, y is the output vector, x is the sate vector, and w and v are 
white noise processes that are independent with covariance matrices Q and R respectively. The ! 

matrices «h, A, B,G, and H determine the dynamics of the process and correlational characteris¬ 

tics of the disturbances. The various matrices are considered as functions of the parameters 8 
specifying the process. The white noise processes model the covariance structure of the error in 
predicting y from u . 

For time series analysis and system identification, the parameterization of the model is an j 

important issue. The elements of all of the matrices of the sate space model (4-12) and (4-13) 
and noise covariances are not independent parameters of the model. In fact for each distinct pro¬ 
bability distribution there is an equivalence class of models of the form (4-12) and (4-13) with the \ 

same distribution. It can be shown (Candy, Bullock, and Warren, 1979) that the number of I 

independent parameters is 

K(k) =2kn («-t-1) 2 +km (4-14) | 

where k,n, and m are the vector dimensions of the sate x,, outputs y,, and inputs u, respectively. 

If there is no instantaneous feedforward so A =0, then the term nm is deleted, while if there is no i 

input so A K7-0 the terms km +nm are deleted. •, 

) 

i 

i 
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The fp tf* parameterization (4-12) and (4-13) is not unique, however in the next section 
i well conditioned procedure for selecting * unique model for the equivalence class will be 
described. For an individual state space model there exists a corr e sponding ARMA model and 
visa vena. However the two classes are not equivalent as classes. In general there is no ARMA 
class of models equivalent of a particular state space class. The ARMA class has one major diffi¬ 
culty • there is no global parameterization of the state space models of a given order. The diffi¬ 
culty is in the ARMA representation which becomes singular at certain models such as one 
involving the cancellation of a pole and a zero. This causes great difficulty in numerical methods 
in attempting to automatically identify higher order models which may involve such cancellations 
of poles and zeros. 

The major advantage of the state space models is the availability of efficient and numeri¬ 
cally well conditioned procedures for model identification discussed in the next section, and the 
explicit Markov structure allows for the the development of direct adaptation procedures 
developed in the next chapter. 

43 Canonical Variate Analysis of Time Scries 

The canonical variate analysis method for identification of state space time series models is 
described in this section. The methods for the determination of the state order and selection of 
the state using concepts of canonical variate analysis are first discussed. The determination of the 
state space model is then computed by simple regression. The computation involves primarily a 
singular value decomposition of the sample covariance matrix of the process. 

A generalization of the canonical variate analysis method has recently provided a completely 
general solution to the static reduced rank stochastic prediction problem which is well defined 
statistically and computationally even when some or all of the various covariance matrices are 
singular (Larimore, 1986b). All other previous methods in the statistical literature do not address 
the general problem. This result is the foundation of the time series analysis methods using 
predictive inference and entropy including the adaptive time series methods. 

The original development of the canonical correlation analysis method of mathematical 
statistics was by Hotelling (1936; see also Anderson, 1958). The application of canonical variate 
analysis to stochastic realization theory and system identification was done in the pioneering work 
of Akaike (1974a, 1975, 1976). This initial work has a number of limitations such as no system 
inputs, no additive measurement noise, substantial computational burden involving numerous 
SVD's, a heuristic set of decisions for choosing a basis for representation of the system, and a 
number of approximations including computation of the A1C criterion for decision on model 
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order. 

Some important generalizations and improvements in Akaike's canonical correlation method 
have recently been made by Larimore (1983b). These include generalization to systems with addi¬ 
tive measurement noise and with inputs including feedback controls. A major departure of the 
approach from previous work is the use of a single canonical variate analysis to optimally choose 
k linear combinations of the past for prediction of the future. The very natural measure of qua- 
dratically weighted prediction errors at pomibly all future time steps is used. Formulated as such 
a prediction problem, it is shown how a generalized canonical variate analysis gives the solution 
explicitly. The interpretation of canonical variates as optimal predictors is central in motiv ating 
interest in such a problem formulation and is scarcely found in the statistical literature (Larimore, 
1986b). The optimal k -order predictors are not in general recursively computable, but the 
optimal state-space structure for approximating them is expreaed simply in terms of the canonical 
variate analysis. The problem of finding an optimal Hankel norm reduced order model (Adam- 
jan et al, 1971; Kung and Lin, 1981) is related to the canonical variate approach (Camuto and 
Menga, 1982; Larimore, 1983b). The balanced realization method is a particular case of the gen¬ 
eralized canonical variate analysis (Desai and Pal, 1984). 

To more concisely discuss the canonical variate method, the results in Larimore (1983b, 
1986b) are briefly reviewed. Consider the problem of choosing an optimal system or model of 
specified order for use in predicting the future evolution of the process. As in Section 42, con¬ 
sider the past p, of the inputs u, and outputs y, before time t and the future of the outputs y, at 
time t or later so 

Pi * ( • ■ ‘ . f! » (ft+iOta, • ■) (4-15) 

We assume that the processes u, and y, are jointly stationary and denote the covariance matrices 
among/, and p, , and V 

The major interest is in determining a specified number k of linear combinations of the past 
p, which allow optimal prediction of the future /,. The set of * linear combinations of the past 
p, are denoted as a 4x1 vector m, and are considered as 4-order memory of the past. The 
optimal linear prediction /, of the future /,, which is a function of a reduced order memory m ,, 
is measured in terms of the prediction error 

E{\\f, -/, || l } ) = E{(f ,-/,)> (4-16) 

where £ is the expectation operation and t denotes the pseudoinverse of a matrix. The optimal 
prediction problem is to determine an optimal k -order memory 
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«•» “ hP, (4-17) 

by choodng the k rows of J k such that the optimal linear predictor based on n, minimizes 
the prediction error. 

As derived in Larimore (1986b), the solution to this problem in the completely general case 
where the matrices 2'. f/ ,2 rr , and 2 ff may be singular is given by the generalized singular value 
decomposition as stated in the following theorem. 

Thmcsm 1. Consider the problem of choosing k linear combinations m, - J k p, of p, for 
predicting /, such that (4-16) is minimized where 2^ and 2 ff are pomibty singular positive sem- 
idefinite symmetric matrices with ranks m and « respectively. Then the existence and uniqueness 
of solutions are completely characterized by the (2^ JL ff >generalized singular value decomposi¬ 
tion which guarantees the existence of matrices J, L, and generalized singular values yj, • ■ • ,y T 
such that 

•'V' 1 ' -U • L2„L t =1; , J2, f L T -Dtagfri *...* 7 r ,0.0) (4-18) 

The solution is given by choosing the rows of J k as the first k rows of J if the k-th singular value 
satisfies y k > y kH . If there are r repeated singular values equal to y t , then there is an arbitrary 
selection from among the corr e sponding singular vectors, ix. rows of J . The minimum value is 

min || II l - (1 - 7?) + • • • + (1 - 7*) (4-19) 

rmk(J k Z m Jl) -1 *// 

This result not only gives a complete characterization of the solutions in selecting optimal 
predictors m* from the past p, tor prediction of the future /,, but the reduction in prediction 
error for all possible selections of order k is given simply in terms of the generalized singular 
values. This is of great importance since it avoids having to do a considerable amount of compu¬ 
tation to determine what selection of order is appropriate in a given problem. 

The generalized CVA method allows the determination of the fit of the various state space 
models and the selection of the best model Rate order before computation of the state space 
models. Consider the general case of identifying a state space model: given the pest of the 
related random process u, and y, , we wish to model and predict the future of y, by a k-order 
state-space structure of the form (4-12) and (4-13) 

x,., - -i-Gu,+w, (4-20) 


y, - Hx, +Am, +Bw, +v, 


(4-21) 










Ia the computational problem given finite data, the pest end future of the process are taken to be 
finite of length 4 lags so 

pJ - (y.-i. • * • jk*. • • ’ •“£*) • /« f * Of. ' •' O',*) (4-22) 

Akaike (1976) proposed choosing the number 4 of lags by least squares autoregresrive modeling 
using recursive least squares algorithms and choosing the number of lags as that minimiTiwg the 
A1C criterion d iscus s ed below. This insures that a sufficient number of lags are used to capture 
all of the s ta tis ticall y significant behavior in the data. This procedure is easily generalized to 
include the case with inputs a,. The generalized SVD of Theorem 1 determines a transformation 
J of the past that puts the state in a canonical form so that the memory m, - Jp, contains the 
states ordered in terms of their importance in modeling the process. The optimal memory for a 
given order k then c o rresponds to selection of the first k states. 

In order to decide on the model order to select, the Akaike information criterion (2-10) is 
computed where the number of parameters is determined from (4-16). Once the optimal k-order 
memory m, is determined, state-space equations of the form (4-12) and (4-13) for approximating 
the process evolution are easily computed by a simple multiple regression procedure (Larimore. 
1963b). 

Since the CVA system identification procedure involves the state space model form, it has 
the major advantage that the model is globally identifiable so that the method is statistically well 
conditioned in contrast to ARMA modeling methods (Gevers and Wertz, 1982). Furthermore, 
since the computations are primarily a SVD, they are numerically stable and accurate with an 
upper bound on the required computations (Golub, 1969). Thus the method is completely reliable. 
It has been demonstrated as such in the time series analysis software Forecast Master that is com¬ 
mercially sold by SSI. From the theory of the CVA method (Larimore, Mahmood and Mehra, 
1964), it can be shown that there are no difficulties such as biased estimates caused by the pres¬ 
ence of a correlated feedback signal. The CVA method was demonstrated in real time identifica¬ 
tion and adaptive control of unstable aeroelastk wing flutter on a scale model F-16 aircraft in the 
NASA Langley Transonic Dynamics Wind Tunnel in February 1986. 










5. ADAPTIVE TIME SERIES ANALYSIS 


The state space model identification methods are developed in this chapter (or adaptive time 
series analysis. The concepts of a changing Markov process are first discussed along with con¬ 
cepts of a piecewise constant model of the process that is constant over intervals of time. The 
approach to adaptation to slow changes using predictive inference and entropy is described. This 
leads to a model fitting criterion for choosing an optimal data interval that balances the decreas¬ 
ing —»p«"g variability with increasing sample sins against the increasing misamodeiing error due 
to use of a constant model over an interval of data. In fitting models involving abrupt changes, 
the models fitted over various intervals are compared to determine if an abrupt change has 
occurred. This involves the comparison of models determined from dam on different data inter¬ 
vals in predicting the error on a different interval. Several examples are given in using the pro¬ 
cedure on changes involving the dynamics, noise excitation, measurement nous, and other 
changes. 

Concepts of adaptive systems have been around since the 1950's involving various senses of 
adaptation. The present literature on the subject includes a number of methods such as recursive 
computational schemes, exponential forgetting, lattice computational methods, etc., which have 
certain 'knobs' that allow tuning of the algorithm to accommodate changes in the characteristics 
of the actual pro ce ss es . Reviews of these and related methods are contained in several recent 
special issues of technical journals and books (Special lame on Adaptive Control, Automatica, 
Vol. 20, No. S, 1985; Special lame on Linear Adaptive Filtering, IEEE Trans, on Information 
Theory, Vol. 30, No. 2, 1964; Honig and MesKrschmitt, 1984). While these methods do permit 
some degree of adaptation to process changes, the methods of adaptation are ad hoc, and no 
sound underlying statistical principle for adaptation is propos ed or demonstrated. As might be 
expected, these methods can work poorly on certain cases because of the lack of a sound statisti¬ 
cal basis. 

In particular, the recursive prediction error and lattice methods are convenient due to their 
recursive form and provide an estimate at every observation (Friedlander, 1982a, 1982b, 1983; 
Ljung and Sode r tt r om, 1983). Also, the recursive algorithms can be used for adaptation by 
exponential weighting of the past data (Wellstead and Sanoff, 1981; Irving, 1979; Evans and Betz, 
1982). But the rational for exponential weighting has not been given a sound fundamental justifi¬ 
cation, but is used largely due to its ease of use. The choice of the exponential weight has been 
ad hoc and susceptible to misinterpretation of changing noise variance levels as time varying 
changes in the dynamics (Haggtund, 1983). 
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Adaptation to abrupt changes has been largely discussed in the fault detection literature. A 
comprehensive survey of fault detection methods is given by Wiliiky (1976). See also Mehra and 
Peschon (1971), Willksy and Jones (1974), Willsky (1980), and Isermann (1984). These methods 
have a number of short warnings in detecting changes in dynamics, computational illcondition- 
ing, and excessive computational burden. 

The central computation of any adaptive algorithm involves the extension of methods for 
identification of stationary time series. There are several difficulties with currently available 
methods and software for the identification of system dynamics and noise characteristics. Current 
m ethods include the self tuning regulator (STR) (Ljung, 1983; Astrom, 1973; Astrom et al, 1973, 
1977), likelihood estimation (MLE) (Mehra and Tyler, 1973; Larimore, 1981a), the 

Bax-Jenkins (BJ) method (Bax and Jenkins, 1976), and a variety of heuristic approaches. The 
current state of the art in both MLE and BJ require that an analyst be involved in the procedure, 
and the required number of computational iterations is not bounded. The STR has been applied 
successfully to simple pro ce ss e s, but is not completely reliable for general p ro ces s es particularly 
when multi-input, multi-output systems are involved. In addition, the recursive prediction error 
algorithm used in the STR requires a good initial estimate and so is not suitable for short data 
where no apriori data is available. The heuristic approaches tend to be for special purposes and 
are rather unreliable in general applications. 

S.1 Models far Chan^ng Proems 

The problem of modeling changing processes involves primarily two types of changes 

• changes that are slow compared to the data interval used for identif¬ 
ication 

a abrupt changes occurring infrequently compared to the data interval 
used for identification. 

If the changing process changes too rapidly or the abrupt changes occur too frequently relative to 
the data interval required for sufficiently accurate identification, then it is not possible to 
separate the actual system changes from the variability due to sampling. 

Consider a time varying Markov process where the conditional probability of the future 
given the past depends upon time i so 


p{f,\p,,64) -p(f,\x„Bj) 

where the state, defined as linear combinations of the past p ,, varies with time as 

*i ■ c iPi 


(5-1) 


(5-2) 












I 
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la order to upwa the *»«w evolution of the state of a Markov process in terms of a system of 
state space equations of the form 

+ G,u, + w, (5-3) 

y, - H,x, + A,*, + B,w, + v, (5-*) 

where the various matrices are time varying, it is necessary and sufficient that the conditional dis¬ 
tribution of the future /, conditioned on the state x, have the form 

(5-5) 

This condition is essentially that the information in the state x, for predicting the future /, is con¬ 
tained in the state x,_ 1 delayed by one time step and the present inputs u, and outputs y,. This 
condition is satisfied by most physical systems since the memory is stored as energy in physically 
describabie states. If the system changes abruptly, there may also be an abrupt change in the 
input or a large noise innovation v, associated with a significant change in the state of the system. 
Formulated in this way, it is apparent how the state space modeling methods are particularly use¬ 
ful. 

In any modeling method based upon a finite sample of data, only a finite number of param¬ 
eters can be determined which are much fewer in number than the number of data. Of the vari¬ 
ous possible methods for modeling, the simplest and least presumptive is the piecewise constant 
model which is constant orer various intervals of data. Thus consider the model of the form (5-3) 
and (5-4) with piecewise constant parameters 9, 

*>+i - + G,u, + w, (5-6) 

y, ■ # A + A,m, + B,w, + v, (5-7) 

The coefficient matrices G t , H,, A i , and B t are functions of the parameters 9, which are con¬ 
stant over an interval of time T, and change from one time interval to another. 

In the following sections, adaptive time series analysts methods are developed by considering 
various hypotheses concerning slow and abrupt changes. The predictive inference and entropy 
methods provide a means of objectively comparing the vast multitude of such hypotheses entailed 
in the adaptive time series analysis problem. 









TIm probisa at » <4 t— to stow variations is primarily that of determining the length of 
time interval to uas in the time varying model (5-6) and (5-7). Consider the division of a section 
of data into 2* subintervals of length 2* samples were h and / are an integers. Then the various 
hypotheses can be considered such as H,: divide the interval into subintervals of length if. For 
each subinterval l )t for /«1,2,...,2\ suppose a Kate space model denoted J# y is fitted using the 
CVA method with A1C used to select the be* model state order. 

By successive applica t ion of the Markov property, the joint probability density of the obser¬ 
vations conditioned on the initial state is given by 

logp(F,. Y t \ Y 0 ,9 k ) - £ k V< r ;l («) 

y-i 

where - (•[.•£) is the parameter vector for the composite model consisting of all of the 

models over the 2* subintervals. This gives the log likelihood as the sum of the conditional log 
likelihoods on each subinterval. 

Consider the entropy measure of the composite model . Using the asymptotic approxima¬ 
tion (3-5), the negentropy is 

*(*.*) - ^ + R&&) » £[-^- + *(•.•')] («) 

2 )-1 2 

where 8* and are understood to denote the parameters constrained under the respective 
hypotheses involving estimation of these models. Note that there is a tradeoff between the fir* 
term which increases with finer subdivisions of the data and the second term which decreases as 
the number of parameters is increased with finer subdivisions of the data. A minimum of the 
negentropy defines the optimum subdivision of the data. 

To estimate the negentropy from the sample, the A1C is used. From the definition of AJC, 
the A1C co rresp on ding to (5-9) is given by 

AIC(Y u ....Yp&) • ^AlC(Yj,H) (5-10) 

/-i 

The optimal data length is chosen as the * minimising the above AJC. 

As an illustration of this scheme, it was applied to a problem in the 1978 Workshop on Spec¬ 
tral Estimation (Gerhardt, 1978) in estimating the instantaneous frequency of a sine wave with 
time varying frequency in the presence of interference and random noise. The data were 











Mil 


(SECONOS) 


Figure 2. Instantaneous Frequency of Signal (solid) and Interference (dashed). 

and where the now is uniformly distributed with -100<a(i)<100. The participants were told to 
e sti ma t e the instantaneous frequency of the signal which was observed in the presence of interfer¬ 
ence and noise. 

Table 2 gives the per sample AJC corresponding to the identified state space models for 
of the subintervals used which were of lengths 16, 32, 64, and 128 samples. Also given are 
wr sample AIC's of the composite piecewise constant models for samples of 128. It is seen 
the whinterval length producing the minimum average AJC for all of the dam is 32 samples, 
was than use as the optimal subinterval length for modeling the instantaneous frequency. 
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The of the instantaneous frequency wu chosen as the maximum of the spectrum 

obtained from the CVA model fitted to the data. The estimated instantaneous frequency is given 
in Table 3 along with the other three best solutions obtained by the other participants in the 


SAMPLE 

TIME 

TRUE 

WILEY AND 
CARMICHAEL 

WEINER 
ET AL 

ADAPTIVE 

CVA 

MAXIMUM 

ENTROPY 

32 

141.47 

141.47 

14109 

142.13 

141.7 

64 

145.73 

145.73 

14508 

145.13 

1462 

96 

15237 

15237 

152.69 

15222 

152.6 

128 

160.73 

160.73 

160.71 

16105 

1603 

160 

17000 

170.00 

17022 

16930 

169.9 

192 

17927 

17927 

179.48 

178.40 

179.9 

224 

187.63 

187.63 

187.72 

18806 

1882 

256 

19427 

19427 

19433 

193.90 

1940 

288 

198.53 

19833 

19703 

19722 

1980 

320 

20000 

200.00 

199.72 

200.94 

2013 

352 

198-53 

19833 

19709 

19835 

1983 

384 

19427 

19427 

193.93 

194.66 

1940 

416 

187.63 

187.63 

18709 

187.99 

187.4 

448 

17927 

17927 

17807 

17902 

178.9 

480 

17000 

170.00 

169.64 

17021 

1700 

512 

160.73 

160.73 

160.11 


164.0 


Table 3. Instantaneous Frequency Estimates. 

The best solution by Carmichael and Wiley (1978) uses a special zero crossing method that is 
applicable only to pure sine waves so that it will not generalize to more general spectra. 

The adaptive CVA method did about as well as the best of the methods other than Wiley 
and Carmichael, and much better than a lot of them. Note that the adaptive CVA approach 
makes no asmmptions about the form of the spectrum or the character of the time variation. 

Also the adaptive CVA method is completely automatic, and in this example did not involve any 
considerations by an analysis to determine the choke or modification of the computations. As ! 

measured in terms of the estimated instantaneous frequency, the method did very well. 

53 Adaptation la Abrapt Chanpm 

The primary problem in adaptive time series analysis is determining if and when an abrupt 
change has occurred. This problem reduces to comparing two intervals of data and determining 
if the same pro cess model is a better description of the observations than a different model for 
each interval of data. A complication of the problem is that the exact time is unknown and must ! 

be determined from the data. This involves the comparison of a multitude of hypotheses con- | 

cerning the pomibte time of occurrence of the abrupt change. In addition, the ability to detect j 

I 

3 

* 
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the abrupt change depends upon the data length of the data intervals used. The best data length 
for detection depends upon the type of abrupt change since some changes affect the observations 
immediately and the effect decreases rapidly, while for other changes the effect on the observa¬ 
tions takes some time before it is apparent. Thus the consideration of different data lengths 
requires additional hypotheses to be considered and compared. The predictive inference and 
entropy methods give a sound basis for the comparison of the multitude of nonnested hypotheses. 

Consider the problem of determining if there is a change in the the pro ce ss between two dis¬ 
joint data subintervals. The detection problem considered is where the process is modeled as a 
slowly changing process using some efficient procedure such as given in the previous section. The 
notation of the previous sections will be used with subscript 1 or 2 corresponding to the data 
intervals Y x from the past slowly changing models and Y 2 from the new date that is to be com¬ 
pared for detection of an abrupt change. The subscripted parameters 0 , and 02 with or without 
other superscripts, hats, or tildes will denote models based upon the corresponding data interval. 
The date lengths of the two intervals Y x and Y 2 are generally different with the first data interval 
determined by the slow adaptation method and with the second set usually much shorter and of 
variable date length since the best data length for detection of abrupt changes is not known. For 
any selection of the two intervals, we wish to determine if there has been a significant departure 
in the process characteristics between the two data sets. 

Ideally, the hypothesis ( 61 , 62 ) that the models are different over the two subintervals Y x and 
Y 2 verses the hypothesis 6 1 that the model is the same over the joint date set (Y x ,Yj) would be 
compared. But this would involve a considerable number of comparisons. To avoid such numerous 
comparisons, consider the following approximation. Let date set Kj be chosen as the most recent 
optimal length interval preceding Y 2 with corresponding model 0 j which provides a near optimal 
prior model Y x . To detect any abrupt changes in the system, consider the approximation of using 
the model 0 j as an approximation to the joint model 0 / . 

As discussed in Section 43, consideration can be limited to conditional models given the 
past or equivalently the initial state at the beginning of the subinterval. Using such conditional 
models, the likelihood function reduces to 

/»(T„K 2 .0 1 )-p(Ti,e,)p(T2.»i) (5-12) 

The model on the first data set Y x is the same for both the above hypothesis and the change 
hypothesis p(Y x ,9{p( The entropy measure is the difference of the above two log likeli¬ 
hoods which involves only the likelihoods on the second subinterval Y 2 so that 

*(Mi) - £(logpOW - Itvfyj.Sj)] (5-13) 













If the system in fact had an abrupt change between Y x and Y 2 , then since the data length Y x is 
much longer than Y 2 , moat of the information in detecting the abrupt change is in the comparison 
of the two models and hj on the data Y 2 . 

The problem is now to obtain an estimate of the entropy measure (4*13) from the observa¬ 
tional data. The observed log likelihood is used as an estimate of the entropy as in (3-7). The bias 
of this estimate of the entropy measure is 

E[t(Y0 - liYJ) - £[/(§) - /(Xj)] - E (/(0) - l(Y0] 

- -dimfe 2 ) + R(B,Y j) - R&.YJ (5-14) 

where the term dimtf 1 ) in (3-8) is not present since the estimate a 1 is a function only of the sam¬ 
ple Y x which is conditionally independent of the sample Y 2 . Thus an unbiased estimate of the 
difference of negentropies R(d,Y 2 ) - R(0,Y]) of the two models is 

l(Y0 - l(Yj) + dimtf) (5-15) 

This gives a test for the occurrence of an abrupt change between the two data intervals. 
Depending upon the nature of the change and the process characteristics, the best detection 
interval will vary. Some changes give most of the information about the change over a short 
interval while others have a cumulative effect and require a long time interval to detect. 

Consider as an example of the procedure for detecting abrupt changes the ARMA(43) 
model (4-8). Three types of abrupt changes were simulated including an abrupt change in the 
dynamics, in the state, and in the variance of the excitation noise w,. The results of the pro¬ 
cedure for detecting abrupt changes for the case of no change and cases of a simulated abrupt 
change in the dynamics, the excitation noise, and the state are shown in Tables 4, 5, 6, and 7 
respectively. In each case, the entropy measure for detecting the abrupt change was computed 
over various intervals of data of lengths 50, 100, and 200 samples and the abrupt change occurred 
at the saw*|4* time 32S. In the case of no abrupt change, the entropy measure shown in Table 4 
is on the average 0.5186 with a standard deviation of 0.016. As shown in Table 5, the largest 
value of the entropy measure is in fact in the interval samples 300-350 containing the time of the 
abrupt change in dynamics. The following interval of samples 350-400 also indicates a large value 
of the entropy measure. The initial large value in interval 300-350 is due to a transient in the 
state as it settles to a new steady state variance which is largely complete in interval 350400. The 
entropy measure then persists at this value in following intervals. The abrupt change in noise 
variance shown in Table 6 has a different character. The negentropy changes abruptly in interval 
300-350 and remains at that value in succeeding intervals. With an abrupt change in the state 
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SAMPLE 

TIMES 

NUMBER OF POINTS IN DATA SUBINTERVAL 

" 

100 

200 

2&) 

250 

300 

350 

400 




IflPl 

0303 

0318 

0.513 

0340 

0334 

0327 


Table 4. Value of Per Sample A1C for Subintervals of the Sample with No Abrupt Change. 


SAMPLE 

NUMBER OF POINTS IN DATA SUBINTERVAL 

TIMES 

50 

100 

200 

200 

?cn 

0.495 



LJU 

inn 

0313 



JsJU 

28220 



400 

11.628 







Table 5. Value of Per Sample AIC for Subintervals of the Sample with an Abrupt Change in 
Dynamics at Sample 325. 

shown in Table 7, the departure is largely confined to the interval 300-350 although the transient 
has not quite died out in the interval 350-400. 

In all cases there was no assumption as to the nature of the change, and the procedure 
works as well on state jumps, changes in noise variances, or other changes. Note that the charac¬ 
ter of abrupt change is quite different depending on the type of abrupt change that occurs. The 
best detection of abrupt changes can only be achieved by an adaptive procedure that considers 
the multitude of data intervals and selects a near optimal data length for detection. These initial 
results on the adaptive detection procedure demonstrate that it is very sensitive to a variety of 
different abrupt changes in the model. 
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Table 6. Value of Per Sample A1C for Subintervals of the Sample with an Abrupt Change in Ex¬ 
citation Noise Variance at Sample 325. 


r! 

SAMPLE 

NUMBER OF POINTS IN DATA SUBINTERVAL 


TIMES 

50 

100 

200 

1 

200 

250 

0.495 



| 

■wi 

0.513 



1 

J uu 

'lcn 

85.766 



1 

JJU 

0.595 



k 






Table 7. Value of Per Sample A1C for Subintervals of the Sample with an Abrupt Change in 
State at Sample 325. 

















6. SMALL SAMPLE MULTIVARIATE ANALYSIS 


The approach to small sample inference in this Chapter is the use of the entropy measure of 
model approximation error to evaluate the performance of small sample methods. This general 
approech is based on the justification of entropy as the natural measure of model approximation 
error as developed in Chapter 2. The historical Bayesian predictive inference approech plays a 
major role in providing a computable predictive density which is subsequently shown to be 
optimal in terms of the entropy measure. This optimality is established by considering best invari¬ 
ant predictive densities. For the multivariate normal family, several predictive densities are com¬ 
pared with the best invariant to show the large improvements that are possible in small samples. 


8.1 Baycrfao Predictive Inference 

The historical approach to predictive inference involves the use of Bayesian concepts and 
methods to determine the predictive density. Consider the parameterized class of probability den¬ 
sity functions 

F = {p(y,x| 8):0€0} (6-1) 

defined on the joint sample space (X,r). The predictive inference setup as in Section 2.1 is con¬ 
sidered where a predictive density p„(y| x) is to be constructed as an approximation to the true 
conditional density p(y|x,6) for the unkn own parameter value 6. In the Bayesian approach, the 
predictive density is constructed on the basis of an assumed prior density p(8) on the parameters 8 
using Bayes rule. 

From Bayes Theorem, the posterior density of the parameters is given by 

<«> 

where the marginal density of x is 

p(*)-/p(e)p(*|eye («) 

The Bayesian predictive density p*(y| x) is then given by computing the marginal density using 
the posterior so 

P*(ylx) -/p(y|x,e*(e|x)de M 


This approach is direct and simple although the assumption of a prior density p(8) on the 
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parameters 9 is bothersome from both a theoretical and practical point of view. 

A major objection to the Bayesian approach is the use of an arbitrary prior density on the 
parameters to express ignorance. If a uniform density is used for the prior on 6, then a nonsingu¬ 
lar tr ansf ormation of the parameters to a new set ^ - g( 0) and use of a uniform prior on 4 in 
general produces a different posterior density. Thus there is a certain arbitrary choice of the 
parameterization and resulting posterior. A way around this is the use of nonurformativ* prior 
densities. Such densities give posterior densities that are invariant to transformations of the 
parameter space (Jeffreys, 1961; Box and Tiao, 1973). In situations where a noninformative prior 
exists, it can be obtained in terms of the Fisher Information matrix. 

In recent years, some intriguing connections between the bayesian predictive density and 
concepts of entropy, frequentist methods, invariant methods, and noninformative priors have 
been made. Still in a strictly Bayesian context, consider the negentropy measure (2-6) applied to 
the Bayesian predictive density (64). Of course a Bayesian would take expectation of this meas¬ 
ure with respect to the unknown parameters 6 which will be called the Bayts Bisk. Using (64) and 
interchanging the order of integration, the Bayes Risk between any two predictive densities 
Pi(y| x) andp 2 (yj x) is 

£#£ii•fyj.sfriO'l *);P2OI *)) * Jp(®)Jp(*I ®)/pOH *,») *y 

= fp(*)dxjpb(y\ *) lo *^jry dy (*■*) 


Now setting pj(y|x) ■ p b (y\ x) guarantees that the Bayes Risk between p b and any predictive 
density p 2 is nonnegative and zero if and only if p 2 (y | x ) - p b (y\ x). 


Thus in a Bayesian context, the Bayesian predictive density is optimal in terms of the Bayes 
Risk, i jt. the expected negative entropy measure with the expectation also taken over the param¬ 
eters 0. As was noted by Aitchison in the original derivation of this result for the case of x and y 
independent, there are a number of interesting cases where the negative entropy measure, i.e. the 
Bayes Risk excluding expectation over the parameters 9, is not a function of the parameters 9. In 
such cases the Bayesian predictive density is optimal in a frequency sampling sense where there is 
a fixed unknown true parameter value and the negative entropy (2-6) is used as the measure of 
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62 9m Iavariaat Pndfctfve Dmritke 

In several particular cues, the optimal predictive density has been found that minimis the 
negative entropy. Murray (1977) considers the claa of d-dimensional multivariate normal densi¬ 
ties N 4 (h2) with 

p(y\ *2) - (2ir)^| 21 - W exp{-|(y-^r , (y-^)} («) 

In this case Aitchison and Dunsmore (1973, p. 29) show that using the noninformative prior den¬ 
sity proportional to | £| the Bayesian predictive density is the d-dimensional Student distribu¬ 
tion 

P,(y\ *) - St 4 [n-l,iL,(* +lX«-l)' 1 i] (6-7) 

where the d-dimensional vector z is St 4 {kjb^) if it hu density function 

, (f) _ _ (M) 

This Bayes predictive density was shown (Aitchison, 1975) to result in the negative entropy not a 
function of the unknown parameter 0.. It is thus also optimal in the frequency sense. 

This same result wu derived by Murray (1977) using invariance concepts. Consider the class 
G of invariant predictive densities p a (y| z) that are invariant to translations and linear transforma¬ 
tions of the sample x. In this class, the best invariant predictive density p t (y\ x) was shown to be 
(6-7) which gives a constant value of the negative entropy independent of the value of the true 
parameter value 8.. This gives a strictly frequentist interpretation of the Bayes predictive density. 
A stronger result reported very recently (Levy and Pemg, 1986) is the minimali ty the negentropy 
for the best invariant predictive density p ; (y) z) uniformly in the unknown parameters 
8. =* (ibL) among any predictive density in the class G of invariant predictive densities. 

63 Compart— of Entropy for Multivariate Normal 

To illustrate the usefulness of the predictive inference approach using negentropy, the 
results for the multivariate normal distribution are given below in terms of the relative odds of 
the likelihood ratio. Consider the case (6-6) of the multivariate normal density Here 

three methods are compared: the estimative method using the predictive density 
Pe(y\x) ~N 4 (y,Mt)JUx)) where the maximum likelihood estimates £(x) and £(x) are used, the 
best normal with estimates (£,(«+ 1X"-*-2)‘ l i) which minimize the negentropy in the class of 
normal densities, and the Bayesian predictive density which is identical to the best invariant 










The expected negentropies are shown in Table 8 for the above three predictive densities. 


Predictive 

Number of Observations (n) 

Densities 

r« 

6 

11 

14 

20 

50 

1-Dimensional 

Estimative 

1.191 

(2-48) 

0366 

(1.170) 

0.130 

(1.037) 

0.094 

(1.121) 

0.060 

(1.009) 

0.021 

(1.001) 

Best 

Normal 

0.476 

(121) 

0226 

(1.048) 

0.103 

(1.009) 

0.079 

(1.006) 

0.053 

(1.002) 

0020 

(1.000) 

Best 

Invariant 

0282 

0.180 

0.094 

0.083 

0.051 

0.020 

8-Dimensional 

Estimative 

* 

• 

3687 

() 

8.08 

(221.4) 

285 

(3819) 

0.61 

(1.127) 

Best 

Normal 

- 

- 

6.79 

(8.93) 

3.15 

d-56) 

1.63 

(1127) 

030 

(1.010) 

Best 

Invariant 

- 

- 

4.60 

2.69 

131 

0.49 


Table 8. Expected Negative Entropy (and the Geometric Mean of the Likelihood Odds Relative 
to the Best Invariant). 

In compering two predictive distributions, the relevant quantity is the difference between their 
negentropies (2-5) (Larimore (1983a)). The exponential of this quantity is the geometric mean of 
the relative odds of a sample y having come from the two respective predictive distributions. 
This exponential of the negentropy difference is also given in parentheses in Table 8 for the best 


















normal and estimative methods relative to the best invariant method. Since exp{a-b) = l+( a -6) 
for o-h«l, we see that for negentropy differences much less than unity, the odds of an 
observed d -dimensional sample y coming from either of two predictive distributions is about 
equal. For the negentropy difference near unity, these odds are disproportionate of order e =2.7; 
and if it is much greater than unit the odds can get very large. Note that a 20 percent increase in 
the negentropy as between the estimative and best invariant for d =1 and n =20 has only a one 
percent odds advantage. On the other hand, a 17 percent increase in the negentropy as between 
the best normal and best invariant for d =8 and n =14 has an odds ratio of 136. This emphasizes 
the importance of comparing the negentropy on the basis of the arithmetic difference and not the 
relative proportion. Note that for very small samples the relative odds ratio can be much larger 
than unity and even in the tens or hundreds. Thus there is a huge potential gain in the use of 
predictive inference in very small samples as has been noted in different terms by Aitchison and 
Dunsmore (1975, p. 231), Aitchison and Kay (1975) and Murray (1979). 


I 









7. CONCLUSIONS AND RECOMMENDATIONS 


Id Cowctnd—s From Phase I Study 

la this Phase I SBIR study, statistical methods are developed using predictive inference and 
entropy. This approach has a strong intuitive appeal as a result of the justification of the entropy 
measure based upon the predictive inference framework and the fundamental statistical principles 
of sufficiency and repeated sampling. This approach applies to a wide class of inference problems 
including: 

• general inference methods such as parametric or nonperametric 
methods 

• exact evaluation of small sample procedures 

• determination of model order or structure including the case of non¬ 
nested multiple comparison 

• time series analysis including definition of optimal tracking of time 
varying processes and optimal detection of abrupt changes. 

The entropy measure provides a fundamental measure for the comparison of alternative statistical 
procedures and provides a basis for developing optimal statistical inference methods. 

In this study a number of particular topics were addressed from the predictive inference and 
entropy perspective including: 

• statistical model building involving the determination of parametric 
model structure and order in the general case of constrained multi¬ 
ple nonnested alternatives, 

• time series modeling and forecasting involving the determination of 
parametric model structure and order, 

• adaptive time series analysis involving optimal methods for tracking 
slow changes as well as for detecting abrupt changes or failures, 

• small sample inference for multivariate distributions of the exponen¬ 
tial family. 

Some major results were developed on these topics that demonstrate the feasibility and desirabil¬ 
ity of developing statistical methods using predictive inference and entropy. 

A number of results were obtained for the nonnested multiple comparison problem baaed 
upon the study of constrained maximum likelihood estimates. These include: 
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| • cooftderatkn of the general constrained cue 

• general extension of Akaike'i A 1C procedure to constrained non- 
nesed multiple comparison problems 

e solution of the general constrained cue requires that a condition on 
the Fisher information and Hesrian matrices be satisfied 

| e a general model order and structure selection method was shown to 

j be asymptotically optimal. 

! Previous developments consider only the case where the true parameter is approached asymptoti¬ 

cally and exclude the case where the true parameter lies outside the models considered. The con¬ 
strained case investigated in this study gives a basis for viewing the predictive inference and 
entropy method u model approximation when the models are restricted and asymptotically 
biased. These results provide a basis for the use of predictive inference and entropy on the gen¬ 
eral time series analysis and adaptive time series analysis problems involving constrained non- 
* nested multiple comparison. 

! Using currently available methods, the time series analysis problem is difficult because the 

i parametric model structure is unknown and requires the fitting and comparison of many different 

| models. Also currant methods are numerically and statistically illcoaditioaed for some models. 

I The approach of predictive inference and entropy provides a natural solution to the multiple com¬ 

parison problem. The results obtained using the predictive inference and entropy approach for 
multivariate time series analysis include: 

• Explicit expressions for a lower bound on the achievable accuracy in 
J the estimation of the transfer function and power spectral matrix 

i 

j a This lower bound applies to the case where the true model order is 

l unknown and a model order determination procedure is used. 


• The lower bound is achieved for large samples using maximum likel¬ 
ihood estimation and the AIC order determination procedure. 

An example of the estimation accuracy of a true ARMA(44) process using spectral smoothing, 
AR model, and ARMA model fitting show the considerable difference in using these various 
methods. 

| Markov models of time series were developed as a basis for stable time series analysis 

lj methods using the canonical variate method (CVA). This method is numerically and statistically 

t stable and has been applied recently to a number of high order multivariable time series analysis 

I problems. This approach provides the basis for adaptive time series analysis methods. 

: Markov models of time series with changing characteristics were developed including slowly 

jj and abruptly changing processes. Using the CVA method as the computational method, the 

| entropy methods were applied to adaptive time series analysis: 

5 








• Sm arted methods for determining the optimal data length for adap¬ 
tation to slow changes were developed. 

• a — *— methods for choosing the optimal time interval for detec* 
tim of an abrupt change in the proc e ss were developed. 

• The entropy measure is optimally sensitive to any abrupt changes 
including changes in process dynamics, changes in the excitation 
noise levels, and jumps in the process state. 

These results demonstrate the feasibility of developing adaptive time series analytis methods 
based upon entropy methods and the CVA computations. The CVA computations have been 
demonstrated in real time identification of multivariable systems. Thus the feasibility of adaptive 
time series analysis in real time has been demonstrated. 

Small sample methods were developed using the predictive inference and entropy methods. 
The justification of entropy based upon the sufficiency and repeated sampling principles provides 
a sound justification for the use of recently developed small sample methods based upon entropy. 
The Bayesian method was extended to the case where the informative and predictive experiments 
are dependent. The theory is illustrated for the multivariate normal distribution using the Baye¬ 
sian, best invariant, estimative, and best normal predictive densities. The relative measure of 
approximation is shown to be the per sample relative odds ratio which is the exponential of the 
entropy measure. 


7 2 Racansnssndatia— for Further Research and D e velopme nt 

This study has demonstrated the feasibility and usefulness of predictive inference and 
entropy methods particularly in the areas of: 

• constrained nonnested multiple comparison of models 

• model order and structure determination for time series 

• modeling of changing processes using Markov model structures 

• optimal adaptation to slowly varying pro ce ss es by optimal selection 
of data interval 

• optimal adaptation to abrupt changes of unknown type at unknown 
times by optimal selection of the detection data interval 

• automatic stable computation of time series models using the CVA 
method 

• determination of lower bounds for the estimation of transfer func¬ 
tions and power spectra 

These achievements provide a bases for the further research and development of predictive infer¬ 
ence and entropy methods. 







The inn at greatest promise appear to be thoee of adaptive end nooadaptive time aeries 
analysis for the following reasons: 

• the number of potential applications to DoD systems is very large 

• time series analysis and adaptation are the major problems in adap¬ 
tive control 

• to address the adaptive time series analysis problem requires an 
approach that deals with the multiple comparison problem in a fun* 
damental way that is offered by predictive inference and entropy 
methods 

• among the current time series analysis methods, only the CVA 
method is suitable for real time solution of the problem 

• present and near future computers are capable of multivariable iden¬ 
tification in less than a second of computation for high order systems 
of dozens of states 

These methods have been demonstrated to be feasible, and the development of online adaptive 
time series analysis software for general application would provide an enormous capability for 
DoD systems. Presently there are no other known approaches that will achieve this goal. Such 
adaptive time series proc e ssors would allow for the adaptation of systems to slow and abrupt 
changes in the environment. 

The topics recommended for further research and development include: 

• further research and development on the adaptive time series 
analysis methods for adaptation to slow and abrupt changes 

• development of algorithms for implementation of the adaptive 
methods that are numerically stable and accurate and are statisti¬ 
cally reliable 

• prototype algorithm testing to demonstrate the accuracy, reliability, 
and computational requirements on typical DoD problems. 

• software development to provide modular, documented, and verified 
software in one or more general programming languages. 

The achievement of these objectives would provide a dramatic improvement in the performance 
of adaptive methods and the availability of software for adaptation in DoD systems. 
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The general problem of choosing a model from among a multitude of alternative 
models remains one of the difficult problem of statistical inference. Traditional methods 
of statistical hypothesis testing apply directly only to the case where there are two 
hypotheses under consideration and one is a subset of the other, i.e., the two hypotheses 
are nested. In such cases, classical methods are applicable and lead to well understood 
results. In the case were there are more than two hypotheses involved, the use of the 
classical methods are not well defined or understood even in the nested case (see for 
example the discussion in Anderson, 1971, pp. 270). Although the probability of rejecting 
one hypothesis in comparing any pair is well defined, the repeated application of such 
pairwise comparisons results in a test whose properties are not understood. In the case of 
comparing two hypotheses that are not nested, the distribution theory is available but 
much more complicated (Larimore, 1977). 

Beyond these difficulties in carrying out the classical procedures is the lack of a gen¬ 
eral framework for formulating and solving the problem of nonnested multiple comparison 
of constrained models. The predictive inference approach offers a predictive measure of 
the accuracy of various model selection procedures that apply as easily to the case of non¬ 
nested multiple comparison. The adequacy of a model selection procedure is measured in 
terms of the accuracy of the selected models in hypothetical repeated 'future* experi¬ 
ments. This is very attractive in the context of scientific inference were the role of model 
building is to provide a basis for prediction of the future behavior of a phenomenon. The 
entropy measure provides a most sensitive measure based upon the sufficient statistic as 
contained in the likelihood ratio. The derivation of the entropy as a measure of the pred¬ 
iction error of a predictive density in the predictive inference framework is based upon 
the fundamental principles of sufficiency and repeated sampling (Larimore, 1983). This 
provides a strong theoretical basis for the use of entropy in predictive settings of scientific 
inference. In more narrowly defined problems of quality control or decision theory 
involving a well defined lorn function, other procedures may be more appropriate. But in 
a predictive scientific setting, the approach using predictive inference and entropy seems 
much more justified. 

Consider the problem of choosing among a multitude of model structures on the 
besis of a set of observations If we adopt the predictive criterion that the chosen model 
should be the best in a predictive sense in predicting another independent sample from 
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the «««"* process, then the optimal choice is the model selection procedure with the 
minimum aeg eatf opy. The major problem is the practical evaluation of the negentropy 
measure and the determination of efficient procedures that come close to minimising the 
negentropy measure. 

In this paper, the theory of inference for nonnested multiple comparison is 
developed in the context of predictive inference and entropy. To develop a general 
theory, the case of maximum likelihood is considered for moderate and large samples. 
The case where the true process model is not contained in the models considered for 
inference is the usual situation in scientific inference since even the most general model 
forms usually do not include certain complexities such as nonlinearities, nonstationarity, 
etc., that may have a small effect or be very difficult to handle. Previous approaches 
involving the entropy measure have not explicit included this miss-modeling. It is shown 
that this miss-modeling can be directly considered in the analysis. The resulting theory is 
very attractive in that is gives an explicit interpretation of the predictive inference 
approach as model approximation of the true process using simplified alternative model 
forms. The entropy methods lead to procedures that select models that in the predictive 
sense are the most accurate in approximating the true process model. The classical diffi¬ 
culties of nonnested and multiple comparison do not arise in this predictive inference set¬ 
ting. 

In the paper, first the subject of constrained maximum likelihood estimation is 
developed in the predictive inference and entropy context. This is used to derive the 
expected negentropy for maximum likelihood estimates, and then to determine an 
unbiased estimate of the entropy. Finally, bounds on the achievable accuracy of model 
selection procedures is derived that depend on the number of estimated parameters in the 
model fitting. 

2. CatraiMd Msilmem Likelihood Fadmatfon 

In this section, properties of the maximum likelihood parameter estimates are 
developed for the case that the true probability model is not contained in the class of 
parameterized densities that are considered for inference. The classical development of 
the asymptotic consistency and minimum variance of maximum likelihood estimators is 
for the case where the true denary is contained in the parametric class. 
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The predictive inference framework as in Larimore (1983) is adopted here with 
p(x,9) the parameterized probability density where 6 is a vector of parameters, x is the 
informative sample and y is the predictive sample. Suppose that the parameter vector 
•••) >a a finite or infinite set of parameters, and for each subset of distinct post* 
tive integers *=(*,,... ,k m ) consider the subspace 0 4 of 0 such that only the correspond* 
inf e t| ,... ,0*_ are nonzero where 0* denotes a member of 6*. and let C t be the the 
class of models C t =~[p(x ,&*),&* (S t }. These classes of models are in general nonnested so 
that we do not in general have C t CCj or C, CC*. The maximum likelihood estimator for 
the class C k will be denoted as 0*(x). 

The development of the maximum likelihood theory is straight forward for the case 
where Taylor series expansions are possible. This holds under the following regularity 
conditions (Cox and Hinkley, p. 281): 

(i) The parameter space is closed and compact. 

(ii) The probability distributions defined by any two different values of 6 are distinct. 

(iii) The first three derivatives of the log likelihood /(x,0) with respect to 0 exists in the 
neighborhood of the true parameter value almost surely. Further, in such a neigh¬ 
borhood, n -1 times the absolute value of the third derivative is bounded above by a 
function of x, whose expectation exists. 

In particular, these conditions permit the interchange of expectation and differentiation 
up to second order. 

In the discussion various order models are considered, and the relationships 
between the various orders is developed. The log likelihood function of the informative 
sample x will be denoted by l(x,9), and the gradient row vector and Hessian matrix 
denoted /’(x,0) and /"(x,0) respectively. Expectation , denoted E, will be with respect to 
the true density p(x,9) unless stated otherwise where 0 denotes the true parameter value. 
Define the projection & of 9 onto S t as the parameters 0* €0* minimizing the negentropy 
R m relative to the informative sample x 

R, (0,0*) =* El (x ,0) - El (x ,0*) (2-1) 

At the minimum 0*, the gradient of (2-1) is zero so from the regularity conditions 


Er(x,0*) -0, 


(2-2) 








A-4 

and the minimum is unique if and only if the expected Hessian, denoted £)*, of (2*1) is 
positive definite in an open neighborhood of the minimum. From the regularity condi¬ 
tions, the Hessian is given by D*=£/"(x,8*). 

To determine the moments of the maximum likelihood estimates 0*. consider the 
first order equality 

0 = /•(*,§*) = /'(x.e*) + (0*-0*/r(x,0*) (2-3) 

Taking expectation with respect to the the true density and using (2-2) gives the equation 
D*(£ 0* -0*) =0 (2-4) 

that holds asymptotically for large informative sample N. For 0* identifiable, i.e. 0* 

unique, D* is nonsingular which implies that to first order 

£0* = 0* (2-5) 

Now using (2-3), the covariance of the estimation error is 

=(D 1 *)" 1 £{/’ 7 (x,0*)/(x, 0*)KD j *)- 1 (2-6) 

Note that in the unconstrained case, the middle term which is the Fisher information 
matrix is equal to minus the expected Hessian D*, but this is not in general true for the 
constrained case. 

3. Expected Negative Entropy for Maximum Likelihood Estimates 

To evaluate the expected negative entropy for maximum likelihood estimates, con¬ 
sider the predictive inference setup as in Larimore (1983). The general case of depen¬ 
dence between the informative and predictive samples is considered. The expected nega¬ 
tive entropy is a measure of the degree of approximation of the true conditional density 
p(y\ xfi) by the predictive density p{y\ x,0*) for predicting the future predictive sample y 
from the informative sample x. The expectation will be taken in two steps, first with 
respect to the random variable y| x and then with respect to x. In this section the likeli¬ 
hood function /(•*) - l(y\ x,0*(x)) is for the predictive sample y|x, and the maximum 
likelihood estimator 0*(x) is on the informative sample x. 

Expanding (2-1) in a Taylor series gives a second order expression for the informa¬ 
tion distance which holds asymptotically for large sample size of the informative sample, 
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i.e. for the maximum likelihood estimate 0* close to the projection 0* 

•*(*)) * E [/(•*) - /<•*)] + E[IW - /(§*)] 

= - o*)l - - o*)] + £:[/(©) - /(6*)] 

- -j£j &(x)-6 k || *, + *,,,(0,0*) (3-1) 

since 0*(x) is independent of /(y| x,0*) and using the gradient property of the projection 
(2-2). The second order expansion is only locally in the estimation error 0*-4* about the 
projection 0* of the true parameter value § on the subspace of the parameters correspond¬ 
ing to the model 0*. Note that this expression gives the exact bias term *,,,(0,0*) in small 
samples and involves no approximation. 

4. Unbiased Estimation of Entropy 

For decision on model parametric order and structure, it is necessary to estimate 
the negative entropy based on the informative sample. One such procedure is due to 
Akaike (1973). We consider the case where the informative sample x and the predictive 


sample y are independent. For each selection of a parameter subset k={k j. k m ), the 

Akaike information criterion for comparing the maximum likelihood estimators is 

AlC{k) = -Hog, (x ,0* (x)) + 2 K(k) (4-1) 


where K(k) is the number of parameters, i.e. the dimension of 0* . The minimum AJC 
estimator (MA1CE), denoted 0 A (x), is 0 4 (x)=*0* (l) (x) where k (x) is the parameter set 
minimizing AlC(k). The AIC(k ) is an unbiased estimator of the negative entropy based 
upon the informative sample and the assumed model structure. The predictive sample is 
essentially replaced by the informative sample, and the term 2 K(k) is an adjustment for 
the bias due to the correlation between the informative sample x and the estimate 0*(x). 

Following Akaike, we use the maximized log likelihood /,(0*) -/(x,0*(x)) as an 
estimate of the relative entropy and compute the bias in the procedure. We expand the 
log likelihood function as in (3-1) except that below the likelihood is on the informative 
sample so that there is dependence between / (x ,0*) and 0*(x). In particular, using (2-3) 
gives 
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- £[/(9*Xe* - 9*)] -£[(•* - ®*) r n®*X®‘ - •*)] (4-2) 

Consider expending the expected log likelihood difference as in (3-1) but using (4-2) as a 
result of the dependence. Thus 

E\m - /(•*)] - £[/(•*) - /(«*)! + £[/(•) - /(§*)] 

- -£[/(«*X9* -9*)]- £[{(»* - Vfi ”(®‘x®* - 9*)] + £[/(9) - /(©*)] 

- £((e* - e*//"(8*Xe* - o*)] + £(0,0*) 

- -tr (D*T X E {/ T (x ,0*)/ (x ,0*)} + £ (0,0*) (4-3) 

where the third equality follows using the expression (3-1) for the expected negentropy. 
In the general case, the bias term in the negentropy, £(0,0*), is correctly estimated except 
for the trace term. Asymptotically, if 0* approaches 0, then the matrix of the trace is the 
identity. This is the case considered by Akaike (1973). In the general constrained case, 
this may not be the case so that the bias depends upon the unknown true parameter value. 
What is required is a restriction such as the Hessian and Fisher information matrix being 
equal globally which gives 

-*r(D*r , £{/T(x,0*y (x,0*)} = trl K{k) = K(k) (<W) 

Consider the case of fitting two models 0* and &, and consider the expected differ¬ 
ence at the maximized log likelihoods 

£{/(0») - !&)) - £(/(9) - l(V )] - £[/(§) - /(§*)] 

=* + di/n(0*)-dim(0') + £(M0 - £(0,0*) (4-5) 

Thus for relative comparisons among hypotheses based on a given sample, an unbiased 
estimate of twice the negentropy £[/(0) - /(0*)] is given by the Akaike information cri¬ 
terion. Note that the proof of this is much more general than that originally given by 
Akaike (1973) since it applies to the general case of comparisons of nonnested structures. 
Also, the true parameter 0 need not be contained in the structures being compared so 
long as the Fisher information matrix is a constant in a neighborhood including the true 
parameter and its projection onto the subspaces of these structures. 
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5.- - the Accuracy of Model Sdccdoo 

To obtain a correction for the bias in the sample log likelihood as an estimate of the 
entropy measure, the Fisher information must be constant or other restrictions are 
required. In this section the Fisher information is assumed to be constant in a neighbor¬ 
hood of the true parameter 0 containing the projection 6*. Under these conditions a 
lower bound on the entropy measure is derived. From the above relationships and stan¬ 
dard arguments of asymptotic consistency, it follows (Cox and Hinkley, 1974, p. 292) that 
the constrained maximum likelihood estimate 0* is consistent and the limit in probability 
is 0*. In addition the properties of the unconstrained maximum likelihood translate to the 
projection 0* since the Fisher information matrix is constant. 

Consider first the case of estimating the model 0 using the *-th order maximum 
likelihood estimator 0*. Then from (3-1) 

*(0,0*) * ~2 tr £>*£{(e*-«*X0*-0*) 7 '} + *(0,0*) * ^ + *(§,0*) (5-1) 

where the inequality follows from the Cramer-Rao lower bound 
£ x [(0*-0*) r (-D*X®*-^*)] a trt K{ki N = K(k)/N where K(k) is the number of parameters 
estimated in the model 0* and N is the sample size. The last term in (5-1) is the bias in 
using too low an order in the model fitting, and the first term is the sampling variability 
apart from the bias. This bound K(k)/2S on the variability is achieved for an asymptoti¬ 
cally unbiased and efficient estimator for the class C k such as maximum likelihood. In 
particular, if the true model order is no greater than k , then 

Lim /V[*(0,0*) - *(0,0*)] ^ ~ (5-2) 

The true order k of the process is usually not known and may in fact be infinite, and the 
bias term in (5-1) is not known so that the above discussion is not very useful in practice 
although it does give some insight into the accuracy issue. 

Consider now the Akaike MA1CE procedure using the estimator 0 A . Assume that 
the true model order is infinite, so that for any j there exists a jstj such that OyX). 
Define the optimal predictive order t'(-V) depending upon the sample size N as the order 
k minimising the negentropy (5-1). Then under suitable assumptions, the remarkable 
result is obtained by Shibata (1981a, 1981b, 1983) that asymptotically asN-« 
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(i) the lower bound using any order selection scheme is for each N given by evaluating 
(5-1) at k «**(jV), and 

(ii) this lower bound is achieved by the MAICE estimator d A . 

Thus the lower bound on the aegentropy which is achieved by MAICE is equal to the 

negentropy that would result from using an efficient estimator with apriori knowledge of 

the optimal order k*(N). 
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1. INTRODUCTION 


The problem of determining the achievable accuracy in identifying a model for a station¬ 
ary multiple time series is considered in this paper. The cases of the presence or absense of an 
exogeneous input or additive measurement noise are included. Consider the general case where 
x(r) is the exogeneous input and y(t) is the observed endogeneous output of a system which may 
include other unknown excitations and measurment noise. Thus consider the jointly stationary 
gaussian vector time series x(r) and y(t), r =...,-1,0,1,..., with power cross-spectral matricies 
$„(<«»,0), Sjyftit.d), S w (<i>,0) parameterized by 0, and denote the power cross-spectral matrix of the 
joint vector (x r (r), y T (t)f as S(ui,0). 

Statistical inference is considered on a class of linear Gaussian processes parameterized by 
0. Specifying a parametric model for the conditional process y(t),tses given x(r), t<s implies a 
causal linear model of the form 

y(0 = ?(0 + 2 *(*-t;®)*(0 = ?(0 + r(t) (l) 

r»O 

where h(rft) is a causal linear system giving the response r(r) in y(t) due to the past exogeneous 
input x(r) and where q(t) is the error in predicting y(i) by r(t) From linear prediction theory 
(Gikhman and Skorokhod, 1969), the transfer function of h(t ;0) is //(<*> ;0) =S^ (u> ,Q)S^(u> ,0), and 
the error q(t) in predicting y(t) is uncorrelated with r(t) with power spectrum S w (<i>;0)=S w (<i>,tt) 
- H (w.O)^(w,0)W*(<*>,0)- Note that any class of parameterized models S(<*>,0) can be equivalently 
specified by the parameterized models (S w (w,0), H (<u,0)) which will prove more convienent. 

2. ENTROPY AND SPECTRAL ACCURACY 

Consider the following predictive inference setting (Larimore, 1983) involving an observed 
informative sample u T =(x T (l)j T (l),... r z r (N)j T (tf)) of size N used to estimate the process model, 
and similarly consider a conceptual predictive sample v of size M used to evaluate the accuracy of 
the estimated model. The predictive sample is assumed to be identically distributed but indepen¬ 
dent of the informative sample. Consider the problem of inference on the parametric class 
{p(v,0),0€0} of models with probability densities p(v ,0) based upon the informative sample u . 
Consider the conceptual repeated sampling experiment where on each trial the samples u and v 
are each drawn independently from the process 5(u,0.) with 0. assumed to be the true parameter 
value. An estimative model p=p(v ,0(u)) is chosen for the density of v by some parameter estima¬ 
tion scheme 0(u) . The negative entropy, also known as the expected Kullback-Leibler discrimina¬ 
tion information or expected I-divergence, is a measure the error in approximating the true 
density of v by the estimate p and is given by 
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R(p.j) = E u K(p.j>) = E„ f p(v, 9.) log-^ V . e,) ^ dv (2) 

p(v,e(«)) 

where £ a denotes expectation relative to u and K denotes the Kuilback-Leibler information. The 
negative entropy measure follows as the natural measure in the predictive inference setting from 
the fundamental principles of sufficiency and repeated sampling (Larimore, 1983). This 
approach applies to very general modeling methods such as non parametric, semiparametric or 
parametric procedures as well as methods including decisions on model structure or order such as 
those used for AR and ARMA modeling. 

Let lower case variables denote a sample of size M of the predictive sample, eg. 
y=(y r (l) 1 y r (2),...,y r (A/)) r and denote the covariance matrix of y. By expressing the density 
p(yj fi) ■ P(y~r fi)p(x ft) in terms of the conditional random process q(t) = y(t)-r(t), 

=p(y|x,0)p(x,0) =p(y - r(x ,0),0) p(x ,0) = p(q\ r(x,0),0) p(x ,0) (3) 

the log likelihood separates with the density of x(t) in many problems not a function of the unk¬ 
nown parameters or at least a function of a separate set of parameters. A conditional viewpoint 
is taken in the following where only the conditional term p(q\ r(x,0),0) with x considered as non- 
random is considered. The depencence of r on x will be understood in the notation. Inclusion of 
the second term is tantamount to modeling the joint vector time series involving the two series 
x(r) and y(t) jointly rather than as exogeneous and endogeneous respectively. The joint case is 
included as a special case of y*(t) = (y T (t)pc T (t)) a vector process with no input x(r) which will 
be discussed as a particular instance of the model throughout the paper. The I-divergence (2) con¬ 
ditional on x thus becomes 

. p(q\x,9.) 

K(p.j) = J pfa|x,0.) log———7- dq = E\o%p{y-r.X ) - E\o&(y-i£ ) 

P(<l\ x ,0) " 

” Elogp(y-r,£ n ) - Elogpify-r.^r.-r)^) 

- Elogpto-r.J") - £logp((y-r.)i„ ) - E(r.-rft^(r.-f) (4) 

where t »r(x,0), r. ” r(x ,0.), and where E denotes expectation with respect to the density 
P(y| *,».). 

For brevity set 5 denote the true spectrum, and let S denote an estimate of S. We will 
need to aanime that S(<a) is continuous and that S w (w) and J w (w) are positive definite for 
. In the discusaon, the predictive sample v will be considered to be conditional on x(i) 
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and to have an infinite sample size M. This will require the normalization of the negative 

entropy and l-divergence by the sample size M The I-divergence per sample time conditional on 

x(r), which will be denoted l(S J) and called l-divergenet for brevity, can be expressed using (4) 
as (Kazakos and Papantoni-Kazakoa, 1980) 

l(Sj) - 

- - \ fUoti + tr[l - S^MV'WJ^ 

j ftr{S„'lH(i.) - //(•)£„ Mf*M - "(“)D-£f (5) 

where the subscript emphasuet that the sample of size M of v becomes infinite. The negative 
entropy per sample, or negearropy for brevity, is defined as N(S J)=lim-jjR(p.j}) =E„I(Sj). 

Note that the (-divergence is composed of two terms, the last due to the error in estimating the 
transfer function //(«) and the first due to the error in estimating the spectrum J w («n) of the 
noise q(<). A useful approximation for the first term in (5) is 

-*«<•>»*£ <«> 

which holds to second order in the elements of 5 w as is easily shown by comparing first and 
second derivatives of the integrands. This is a generalization to the multivariate case of the 
integral of the squared relative error. Thus the I-divergence is approximately a quadratic form in 
the estimation errors of J W M and H(<a), and these quadratic forms do not interact, in. there 
are no crom terms. 

3. NORMALIZED SPECTRAL ERROR IN PRINCIPAL COMPONENTS 

In the multpie time series case, the spectral measure (5) has an intuitive interpretation in 
terms of principal components of the power spectrum in the frequency domain. Principal com¬ 
ponent representations of the spectral tnatricies S W M and S a (<•») have the form 

J(») - D(«) , LM s 0 M L*M - EM 


(7) 



where J(m ) end L(«) given as a function of frequency <o are unitary matrix transformations which 
diagonal^ 5 S («) and S w («) respectively so that y(u>)/*(<») ■/ = f. (<*>)£ *(<*>). end where D(u) 
and £(w) are diagonal matriciei. 

Using spectral factorization theory, the matrix function /(<*») can be chosen as a continu¬ 
ous function of w and the transfer function of a causal filter under either of the mild assumptions 

(i) The process is purely non deterministic ((Gikhman and Skorokhod (1969), Whittle(19S4) for 
the scalar random field case). 

(ii) The autocovariance function is absolute summabie (Goodman and Ekstrom (1980) for the 
scalar random field case). 

These derivations of the spectral factorization for the scalar case generalize to the multivariable 
case with care given to determining the logarithm of a matrix (Larimore, 1984, 1977). Orthonor¬ 
malization of the rows of the spectral factor gives J (w) while the normalizing terms form the diag¬ 
onal of D(w) . Similarly, L( w) can be taken as a spectral factor of a causal filter. 

Let X(w) be the random Fourier coefficients of x(r), ix. the spectral random measure of 
x(t) . Filtering x(r) with transfer function £.(«) gives the principal component process x(r) which 
is expressed in the frequency domain as X(u) * L( u)X(w) , and which has the diagonal spectral 
matrix £(<■») and similarly for q(i). 

Now consider the asymptotically equivalent expression (6) for the first term of the spectral 
measure (5) which is invariant to the unitary transformation J(w) and is thus given by 


1 V f A<«>- D «(«> f dcs 1 } I Dij(»)\ 2 du 

" 4 r i f +* 2 * J D„(«)D y /«) An 

35 i v l ' |Q|»I 2 do, 

4 it D g (m) ' An 2 * ^ fl u («)D yy («) An 


( 8 ) 


where the approximation holds for the diagonal elements of D(u>) near D(v>). The approximation 
is very useful when only the estimated spectrum J„(«) is known and we wish to consider the 
error in estimating the truth 5 w (u) . The first sum on the right hand side is the integrated 
squared relative error of the estimated cospectra of the principal components, while the second 
term is the integrated squared coherency of the estimated spectrum D(w) which would be zero if 
D(w)-D(«i) . Thus the first term of the measure (5) has a clear interpretation in the multivariate 
case when the true spectrum D(w) is diagonal but where the approximating spectrum D(w) is 
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permitted 

arbitrary coherency among components. The general case is reduced to this diagonal case by 
choosing an appropriate filter •/(<■>) which diagonalizes 5 w (w) as in (7). 

The second term in the spectral measure (5) is invariant to the unitary transformations 
y(w) and L(m) which gives 


- | /ir{D- l («XG<«) - G(u,)]£(«XG(«) - C(«)D^ 




(9) 


where G(u»)-J (ta)H («)L ’ («u) is the transfer function H( u) expresKd in the coordinate frame of 
the principal component series x(t) and y(t). The squared magnitude error | G y (») - G y («)| 2 in 
the ij element of the transfer function is weighted by the input signal to output noise ratio 
D u Ejj for the pair « J. 

4. A LOWER BOUND ON ACHIEVABLE SPECTRAL ACCURACY 

A lower bound on the expected negative entropy gives an asymptotic lower bound on the 
achievable accuracy in the estimation of the process spectrum and transfer function. This applies 
to the case of a fixed known model order as well as to the case of an unknown or infinite model 
order with the use of the AIC for model order selection. The achievable spectral accuracy is 
given as a function of the sample size and the number of parameters estimated. 

Consider the case where the model S is estimated using a K dimensional constrained estima¬ 
tor 8*. As derived in Larimore (1986) using the Cramer-Rao lower bound, the expected negen- 
tropy is asymptotically bounded by 


£ - /(5 ’ 5):s 2JV +£ - /(S ' J) 


( 10 ) 


where S is the model to which the constrained maximum likelihood estimate converges so that 
E„l(S J) is the bias in the constrained model and KITH is the sampling variability. 

The bound K/2N is achieved for an asymptotically unbiased and efficient estimator such as 
m axi m u m likelihood. In particular, this assumes that the true model order is no greater than the 
order K used in the model fitting. The true order IT of the process is usually not known and may 
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inf act be infinite, so that the above discussion is cot very useful in practice although it does give 
some insight into the accuracy issue. 

Consider now the Akaike minimum AJC procedure (MA1CE) using the estimator 9 A 
(Akaike, 1973). Assume that the true model order is infinite, so that for any subset of the infinite 
parameter vector, there exist nonzero components. Thus it is not possible to obtain asymptotically 
unbiased estimates of 9 using a fixed model order in estimating a model. Following Shibata 
(1983, 1981a, 1981b), define the optimal predictive order K‘(N) depending upoo the sample size 
N as the order K minimizing the negentropy (10). Then under suitable assumptions, the remark¬ 
able result is obtained by Shibata (1981) that asymptotically as 

(i) the lower bound using any order selection scheme is given by evaluating (10) at K=K*, and 

(ii) this lower bound is achieved by the MA1CE estimator 9 A . 

Thus the lower bound on the negentopy which is achieved by MA1CE is equal to the negentropy 
that would result from using an efficient estimator with apriori knowledge of the optimal order 
K m (N). 

Using' the spectral expression (5), an asymptotic lower bound on the expectation of the gen¬ 
eralized relative squared error in estimating the power spectrum is given by 

s T £ - / ,r{5 «( h, )( 5 «( fa> )- s w(“)]> 2 |7 

+ 7 JVtfV’l'/O*) - //(o»))s d m//(o») - *(«)f(ii) 

This gives a fundamental limit to the achievable accuracy in any parametric estimation pro¬ 
cedure. A further perspective on this fact is given by the justification of the expected negentropy 
as the natural measure of modeling approximation error in statistical inference. 
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